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ABSTRACT 

Exploratory data analysis problems have recently grown in importance due to 
the large magnitudes of data being collected by everything from satellites to supermarket 
scanners. This so-called “data glut” often precludes the effective processing of 
information for decision-making. These problems can be seen as search problems over 
massive unstructured spaces. A prototypical problem of this type involves the search, by 
Department of Defense medical agencies, for a so-called “Desert Storm Syndrome*’ 
which involves large amounts of medical data obtained over several years following the 
Persian Gulf conflict. This data ranges over more than 170 attributes, making the search 
problem over the attribute space a hard one. We propose the use of genetic algorithms for 
the attribute search problem, and intertwine it with search algorithms at the detailed data 
level. Computational results so far strongly suggest that our system has succeeded at the 
given tasks, requiring relatively few resources. They also have found no indication that a 
single syndrome or other medical entity is responsible for wide-spread adverse health 
ramifications among a significant cross-section of Persian Gulf War participants in the 
CCEP program. There are, however, numerous correlations of exposure/demographic 
information and associated symptoms/diagnoses which suggest that smaller groups may 
share common health conditions based on shared exposure to common health risk factors. 
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I. INTRODUCTION 



A. ANALYSIS OF LARGE DATABASES 

Twenty years ago, computers were relatively scarce and qjplied to limited, highly 
^>ecialized applicaticms. At that time, there were rarely enough computerized data to make them 
an integral part of any organization’s decision-making process. As technology approached the 
present day, automated information systems became more cq>able and more involved in daily 
life. They began capturing more and more data, allowing the computer to become an active 
participant in expanding fecets of daily decision-making. The e?q>onentially increasing volume of 
available data has transformed the decision challenge from one of “data starvation” to “data 
saturation.” Fayyad, Piatesky-Shapiro, Smyth, and Udiumsamy (Fayyad, et.al., 1996, pp. xv- 
xvi) attribute this “mountain of stored data” to such fectors as advances in scientific data 
collection, introduction of bar codes, and the computerization of many business and goverrunent 
transactions. In many situations today, there is so much data tiiat human beings are unable to 
correlate it all, and decision quality is again hampered, or in the words of John Naisbett (Fayyad, 
etal., 1996, p. xv.), “We are drowning in information, but starving for knowledge.” 

Cleariy there is a growing need for “intelligent agents,” or automated information 
systems that can sift through these mountains of data (which other systems have eflBciently 
collected) and integrate these sources into concise, usable knowledge for use in hiunan decision- 
making. It is doubtful tiiat a computer can reproduce the irmovative creativity of a human 
analyst, but a computer system can be imparted with a basic representation of some of what the 
human analyst desires. This representation of interest is then used to filter vast volmnes of 
available data (a task too time consuming for hiunans) and present the human analyst with a 
more concise body of knowledge in an rmderstandable form. This premise is supported by many 
documents, such as this quote fiom Fayyad, et. al.: 

Such volumes of data clearly overwhelm the traditional manual methods of data 
analysis such as spreadsheets and ad-hoc queries. Those methods can create 
informative reports from data, but cannot analyze the contents of those reports 
to focus on important knowledge. A significant need exists for a new generation 
of techniques and tools with the ability to intelligently and automatically assist 
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humans in analyzing the mountains of data for nuggets of usefiil knowledge. 

These techniques and tools are the subject of the emerging field of knowledge 
discovery in databases (KDD). (Fayyad, et.al., 1996, p. 2) 

Hie Comprehensive Clinical Evaluation Program (CCEP) database presents this type of 
challenge to data analysis. The CCEP database contains vast amounts of information on over 
19,000 Persian Gulf War (PGW) veterans who have brought some fonn of health concern to the 
attention of the Department of Defense (DoD) military healthcare system. The database contains 
a large number of attributes, and there are still no defined parameters for search. In any case, 
because of problem stracture and sheer size, the entire database carmot be comprehensively 
analyzed by conventional means. The goal of this thesis is to design, construct, and implement 
an artificially intelligent computer system which can analyze the CCEP database more efficiently 
than a conventional or “brute force” ^proach without unduly taxing scarce medical research 
assets. Such computer systems are said to carry out “data mining.” 



B. PURPOSE OF THIS RESEARCH 

The ultimate purpose of this research is provide the CCEP program with a viable 
methodology to obtain useful information fiom its database of participating PGW veterans. 
Determining what constitutes “useful” or “interesting” information is at least as great a challenge 
as devising an analysis tool. However, in the initial stages of medical research, interesting 
information is any statistical association between database attributes of different categorical 
groups. These associations may signal the existence of an imdiscovered common ailment or 
“syndrome” affecting participants in the Persian Gitlf War. 

Time and other resoirrces are also key foctors in the overall CCEP research project. 
Simply investigating every possible combirration of attributes may be theoretically feasible, but 
in actuality often necessitates an impractically large commitment of resources to the analysis 
task. Therefore, investigative speed and efficiency have become key factors in this research. The 
need for speed and efficiency demand that this research develop an intelligent search device 
enable of siftin g through vast amounts of raw data and identifying interesting trends or 
correlations without the need for human intervention. Consequently, a genetic algorithm has 
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been selected. No commercial product suited our particular needs, so the purpose of this researdi 
includes die development and application of a genetic algorithm suited to analysis of medical 
data, specifically the CCEP database. 

Finally, this research evaluated the success of the new genetic algorithm (DaMI, the NFS 
Data Miner) fiom several aspects: 

• DaMI performance adheres to classical genetic algorithm theory 

• DaMI statistical computations are valid and reproducible 

• DaMI efiBciently and comprehensively analyzes the search space 

• Outcome hypotheses are of significant value to medical experts and the program 
sponsor 

As with problem stmctuiing, validation of results has proven to be a major research challenge 
and is addressed in this paper. 

Computational results so fer strongly suggest that our system has succeeded at the given 
tasks, requiring relatively few resources. They also have found no indication that a single 
syndrome or other medical entity is responsible for wide-spread adverse health ramifications 
among a significant cross-section of Persian Gulf War participants in the CCEP program. There 
are, however, numerous correlations of exposure/demogr^hic information and associated 
symptoms/diagnoses which suggest that smaller groups may share common health conditions 
based on shared exposure to common health risk fectors. 



C. SCOPE OF RESEARCH 

This research examines the problem stmcturing challenges for analyzing the data 
contained in the CCEP database. It discusses the general qualities of genetic algorithms and the 
specific techniques used to apply a genetic algorithm to the study of the CCEP database. The 
researdi focuses on application of a genetic algorithm to a relevant real-woiid problem and does 
not contain an in-depth description of genetic algorithm theory. An original genetic algorithm 
(DaMI) was created by this research effort. A technical description of the DaMI algorithm, its 
development process, and evaluation methodology are included. It is not the purpose of this 
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research to survey all possible solutions to the CCEP analysis challenge, but rather to completely 
examine and document one q)parently successful solution. Finally, the results of the DaMI 
analysis of the CCEP database are presented along with the validation process and 
recommendations for fiirdier research. The following research questions were addressed; 

• If there is a (actually there may be more than one) common mlment or “syndrome” 
aflOicting veterans of the Persian Gulf War, how will it manifest itself within the 
scope of information gathered by the CCEP database? 

• How will the subjective concept of interesting information (to the medical 
community) be quantitatively measured and used to compare the “fitness” of 
different hypotheses? 

• How should the research problem and database be stmctured to facilitate automated 
analysis? 

• Why is a genetic algoiithm a more effective means of analyzing the CCEP search 
space than other more conventional methods? 

• How was DaMI constracted? What were the design considerations and key 
innovations in this particular genetic algorithm? 

• What analyses were conducted and what were the results? 

• Were tile results validated and were tiiey usefid to the project sponsor (CCEP, 
Deployment Surveillance Team) and CCEP medical researchers? 



D. REAL WORLD APPLICABILITY 

A great deal of research has been performed on genetic algorithms and related artificial 
intelligence-based research tools. In many cases, the data analyzed were real but in few cases the 
research was tied into a real worid time-sensitive research problem. One of the primary reasons 
for using a genetic algorithm is that an answer is needed, but conventional research resources are 
not available to produce that answer within the allotted time. This makes a study of a real-world 
genetic algorithm development all the more interesting. The CCEP database research is highly- 
visibile, relevant, and time-sensitive. 



4 



Only a select number of medical issues have received as much attention as the proverbial 
“Desert Storm Syndrome” in recent years. Since the first returning Persian Gulf War (PGW) 
veterans began reporting health issues, this subject has received constant attention by the U.S. 
government, military medical researchers, and most prolifically the media. A Presidential 
commission has been ^pointed to determine what, if any, health ailments may be attributed to 
the service of U.S. armed forces in the Persian Gulf. Research efforts continue at many DoD and 
Veterans Administration (VA) focilities. It is certainly appropriate to say that the CCEP is “high 
visibility.” 

Similarly, the concept of reladng diseases to groups of humans with similar symptoms 
and life experiences (demographics and exposure to physical objects) has been a focus of medical 
research for many years. Some of the earliest genetic algorithm experiments attempted to relate 
symptoms to diagnoses. Medical science has consistently searched for better ways to answer the 
question, “What caused this disease?” In the case of CCEP, 697,000 veterans (not to mention 
their fomilies) are eager to know if their service in the PGW increases their susceptibility to any 
type of medical malady. From an academic perspective, the issue of automatically identifying 
“interesting” information has become increasingly foscinating and challenging. Technology has 
increased researchers’ ability to automate aspects of a medical situation, but tire problem of 
making a model that accurately reflects the information remains. 

E. THESIS METHODOLOGY AND ORGANIZATION 

This research begins with examination of the CCEP research challenge as a whole. The 
first challenge is to structure the CCEP research question of what is an “interesting” hypothesis 
into a mathematical formula (fitness function). This in turn returns a higher “fitness” to 
hypotheses of greater interest to CCEP medical researchers. Our research tried many 
alternatives, but settled on the use of the Modified J-measure (described in section n.E.4.c) to 
assess relative independence between premise and outcome variables. The CCEP database was 
not designed with medical research in mind, so the second challenge was to reformat tire database 
into a structure which supported automated analysis. 
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Once tile problem and source database were structured ^propriately, a suitable research 
tool was needed. It was clear that using a “brute force” approach to examine the CCEP database, 
even using computer simulation, was impractical because of the tremendous size of the search 
space. A genetic algorithm was chosen because of the innate ability of genetic algorithms to 
inductively ad^t to the researcher’s goals and to intelligently analyze a search space, bypassing 
hypotiieses which show little chance of future success. Our concept enhanced the conventional 
genetic algorithm ^proach by dividing the process into two modules: A genetic operator, which 
handles selection and recombination of hypotiieses at the field level only, and a statistical 
package, which analyzes every possible combination of hypothesis fields passed fiom the genetic 
operator and returns an integrated fitness measure for the entire hypothesis. Additionally, our 
tool examines multiple independent and dependent (LHS and RHS) fields because CCEP could 
not determine which field or combination of fields would identify a target outcome. 

Finally, the problem of validation and search space coverage must be addressed. A great 
deal of literature supports the idea that a genetic algorithm can deduce hypotheses that apply to a 
database. However, it is critical that these results be both validated against independent data and 
that tiiey be indicated to accurately address tire research question, instead of just exploring the 
data actual set analyzed. Several tools were developed to validate the resrrlts, among tiiem an 
independent validation algorithm which independently re-tests results hypotheses against the 
subject database and a cross-validation procedure that tests hypotheses generated fi'om one ■ 
randomly-sampled subset of the databases against another randomly sampled subset. 

The thesis is divided into seven charters: 

• Ch^ter I : Introduction 

• Charter II : Description of the CCEP Research, the database itself and problem 
stmcture challenges 

• Chapter ED : Overall solution concept and high-level research approach 

• Ch^ter IV : Description of the DaMI algorithm, its design, implementation, and 
validation processes 

• Chapter V : Technical description of the DaMI algorithm operators, iimovations, and 
procedures 
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Chapter VI : Summaiy of results 

Chapter VII : Conclusion and recommendations for hituie research 
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II. COMPREHENSIVE CLINICAL EVALUATION PROGRAM 



A. BACKGROUND AND HISTORY OF CCEP 

The Department of Defease (DoD) began to examine the health consequences of Persian 
Gulf War (PGW) service while U.S. troops were still deployed to the Persian Gulf Region. The 
initial focus of medical researchers was on the health risks associated with smoke fiom Kuwaiti 
oil fires. As eariy as 1992, groups of PGW veterans began presenting with health complaints 
which they attributed to PGW service. Many of these veterans reported nonspecific symptoms or 
those not directly attributable to a specific disease or syndrome (group of commonly occurring 
symptoms/conditions). This sparked the first of many tests (first by the Army in 1992 and 
subsequently by other services) to attempt to discover if these non-specific symptoms could be 
linked with any “clusters” of PGW veterans. The theory of this approach is that a new syndrome 
will presmt as a “cluster” or group of individuals sharing some common trait (demographics, 
location, action, exposures, etc.) who also share a similar group of symptoms. (CCEP, 1996, pp. 
6-7) This is the first step to identifying a new syndrome. Once a syndrome is defined, then 
medical researchers begin efforts to find the cause of the syndrome. If a solid cause-effect 
relationship is established and documented between an entity (virus, bacteria, etc.) or health risk 
factor(s) (like smoking or cholesterol), then the syndrome may be considered a full-fledged 
disease. 

In response to the health concerns of PGW Veterans, both DoD and Veterans Afi6urs 
(VA) established similar comprehensive clinical evaluation programs. The data for this research 
comes from the DoD CCEP. The CCEP program was oflBcially enfranchised by the Assistant 
Secretary of Defense (Health Affeirs) as part of a three-point plan, announced on 1 1 May 1994. 
This plan included: 

• The development of an aggressive, comprehensive, clinical diagnostic program to 
offer intensive examinations to veterans who do not have clearly defined diagnoses. 
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• An initial independent review of DoD clinical and research efforts concerning the 
Persian Gulf War by Dr. Harrison C. Spencer, Dean of the Tulane School of Public 
Health and Tropical Medicine, New Orleans, Louisiana, and 

• The creation of a forum of national medical and public health experts to review, 
comment, and advise DoD concerning the results of the clinical evaluation program. 
(Joseph, 1994) 

CCEP continues to offer in-depth medical examinations, through the Military Health Services 
System (MHSS) to any PGW veteran having health concerns. Over 27,000 PGW veterans and 
their dependents have initiated medical examinations with CCEP, of which over 19,000 have 
been completed by the participants. The data collected fiom these 19,000 participants has been 
recorded in a single database (the CCEP database), which is the source database for this research. 
(CCEP, 1996, pp. 7 - 12) 

Since the inception of CCEP, munerous medical research programs have been conducted 

by DoD and non-DoD health organizations (including the Defense Science Board, National 

Institute of Health, Naval Health Research Center in San Diego, University of California, 

Department of Health and Human Services, and National Academy of Sciences). Although 

several research efforts are still ongoing, the possibility of an unknown syndrome or disease 

affecting PGW veterans and their Emilies has been exhaustively examined. DoD has committed 

to continue research on this issue but stated: 

To date, there is no clinical evidence for a previously unknown, serious illness 
or ‘syndrome ’ among Persian Gulf veterans participating in the CCEP. A 
unique illness or syndrome among Persian Gulf veterans evaluated through the 
CCEP, capable of causing serious impairment in a high proportion of veterans 
at risk, would probably be detectable in the population of 18,598 patients. 

However, an unknown illness or a syndrome that was mild or effected only a 
small proportion of veterans at risk might not be detectable in a case series, no 
matter how large. (CCEP, 1996, p. 4) 

It is this viewpoint that has catalyzed the need for an intelligent, automated search program to 
analyze the CCEP database. Clearly, conventional research (user-controlled query and clinical 
evaluation) has reached the limit of available resources, and yet there is still a possibility that a 
syndrome has remained imdetected. Proper implementation of a genetic algorithm can expand 
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the horizon of research by sifting through hypotheses not yet considered but will do so using 
small amounts of time, funds, and human effort. 



B. CCEP RESEARCH VISION 

The core of CCEP research is based on classic epidemiological technique. The CCEP 
database has been constructed to capture as wide a range of data about PGW participants as is 
practical. Data collection practices have been standardized and imbiased~any participant with a 
concern undergoes the same health screening and examination process. The basic premise of 
analysis is that a new syndrome will present as “prominent and consistent physical and 
laboratory findings” like Legionnaire’s disease or toxic shock syndrome or consistent “non- 
specific symptomatology” as with chronic fetigue syndrome and fibromyalgia. 

In any case, CCEP research efforts focus on slicing the database in many different 
directions, whether by demographic information, symptoms, diagnoses, or reported e?q)osure 
categories. Percentages of PGW participants in each slice or “cluster” (which is a group of 
participants with the same characteristics within a given research slice) are compared to the per- 
centage expected within a similar population not participating in the PGW. In many cases 
(especially when the database is sliced by reported exposures), no comparable group is available, 
so these percentages ate compared against actual percentages or distributions among all 697,000 
PGW personnel (as opposed to just those participating in CCEP). The point of the analysis is to 
isolate any characteristic which qjpears to make a CCEP participant mote likely to have 
^proached CCEP with a medical condition. 

If some specific combination of demographics, personal habits (smoking/non-smoking), 
and reported exposure is associated with specific symptoms and diagnoses with the group of 
CCEP participants, then medical research is developed to clinically test the relationship of these 
fiictors to personal health. It should be ^parent that this approach is extremely resource 
intensive. Analysis dimensions ate limited to the imagination of individual researchers 
developing the slices and the physical ability of medical researchers to examine the hypothesis. 

If the quality of “statistical interest” could be mathematically modeled by an automated research 
tool, then the dimensions of analysis could be expanded to the limits of computer (as opposed to 
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human) resources. The genetic algorithm (DaMI) is a research tool designed specifically to 
relieve humans from the drudgery of human-controlled analysis so that they may focus efforts on 
clinical testing which machines cannot do. 



C. DATABASE DESCRIPTION 

The CCEP database is a “flat file” or single table with 177 attributes. It was created in 
standard dBase© format and was actually received and manipulated using die Visual Foxpro® 
Database Management System (DBMS). The database was not designed with automated 
analysis or medical research (for that matter) in mind. Therefore, a great deal of manual file 
manipulation was required before automated analysis was possible. By “manual” we mean the 
issuance of single SQL® commands to reformat individual database schema and field values. At 
no time was the actual data adjusted, but in many cases the representation schema was changed 
to enhance automated processing. Appendix A contains the CCEP data dictionary alone, a 
commentary on modifications/usability of each field, and a synopsis of the CCEP data collection 
process. The actual database used for research contains 17,033 records for active duty CCEP 
participants. Dependent records were removed prior to analysis at the request of the CCEP 
program manager. 

A large number of attributes containing administrative and/or privacy act data were 
removed fiom the database and other attributes were added to enhance the schema, as discussed 
above. (For a more complete description of schema modifications, see section n.D.2) In all, 140 
attributes were present in the research database. Not all were examined at once (see Section 
VI. A), but in any case the database was relatively large by medical or occupational health 
research standards. The remmning attributes ^1 into four major categories; 

• Demographic. Physical attributes of each participant (e.g. race, gender, age, home 
state, service component. Unit Identification Code [UIC]) 

• Reported Exposures. Reported exposures to potentially hazardous environmental 
conditions by participants (e.g. botulism vaccine, oil smoke, uranium, passive 
smoke, local water, SCUD attack) 
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• Reported Standard Symptoms. Standard symptoms elicited by physicians during 
CCEP medical examinations (e.g. difiBcuIty breathing, &tigue, headaches) 

• Diagnoses. Each participant completing the entire CCEP medical examination 
process was assigned a primary and up to six secondary diagnoses. Diagnoses 
followed the standard numeric ICD coding system (e.g. V65.5 - Healthy Exam, 
307.81 - Chronic Muscle Tension Headaches, 780.71 - Fatigue) 

As will be seen in later sections, most analysis was conducted on associations betweoi these 
m^or attribute categories. 



D. WHY DOES A GENETIC ALGORITHM WORK FOR CCEP 
ANALYSIS? 

1. Theory 

The theory of genetic algorithms was invented by John Holland in the early 1970’s. 
Holland’s purpose was to create a search method based on tire process of natural selection 
observed in nature. He likened the attributes making up a hypothesis in a search problem to 
chromosomes which “encode” a living being. He proposed that by creating mathematical 
representations of genetic reproduction and q)plying natural selection, scored by a fitness 
fimction, to those representations, he could create an ad^tive search engine. Automation of this 
process has proven to be an excellent task for computer systems. Although a great deal of 
evolution is not understood, several general features are agreed upon: (Davis, 1991, pp 2 - 3) 

• Evolution is a process that operates on chromosomes rather than on the living beings 
they encode. 

• Natural selection is the link between chromosomes and the performance of their 
decoded structures. Processes of natural selection cause those chromosomes that 
encode successfixl structures to reproduce more often than those that do not. 



13 



• The process of reproduction is the point at which evolution takes place. Mutations 
may cause the chromosomes of biological parents, and recombination processes may 
create quite different chromosomes in the children by combining material from the 
chromosomes of two parents. 

• Biological evolution has no memory. Whatever it knows about producing 
individuals that will function well in their environment is contained in the gene pool- 
-the set of chromosomes carried by the current individuals--and in the structure of the 
chromosome decoders. 

If one is to follow the theory of natural selection, then it could be inferred that attributes used to 
make hypotheses are the operators of evolution. The process of hypothesis evolution revolves 
around the combination of those constituent attributes of successful hypotheses and their 
resulting recombinations. Furthermore, these recombinations are directed blindly and guided 
only by the principle that attributes belonging to hypotheses of higher fitness measure are 
recombined more fiequently than attributes belonging to hypotheses possessing lower fitness 
measure. 

Holland went on to create three genetic operators which could mathematically recombine 
the modeling chromosomes of coded hypotheses to mimic genetic recombination. Hypotheses 
fiom the gene pool of the current are “selected” with a bias towards hypotheses with higher 
fitness measures, and then operated on by one of these three genetic operators: 

• Reproduction. Asexual reproduction of single parent rule to single o&pring rule 
without modification 

• Crossover. Sexual reproduction involving the exchange of chromosomes between 
two parents producing two different child rales. 

• Mutation. Asexual reproduction of single parent rale with random modifications 
resulting in a different child rule. 

Using the ‘Two-armed and k-aimed bandit problems,” (see Holland, 1975 for complete proof) 
Holland went on to prove that, lacking prior knowledge of the expected value of two or multiple 
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choices, allocating slightly more than exponentially increasing trials to choices with the highest 
past success is the optimal means for choosing between options. TTie results of this theory and 
its relation to genetic operators is summed up well by Goldberg: 

In other words, to allocate trials optimally (in a sense of minimal expected loss), 
we should give slightly more than exponentially increasing trials to the observed 
best arm...Another method that comes even closer to the ideal trial allocation is 
the three-operator genetic algorithm discussed earlier. The schema theorem 
guarantees giving at least an exponentially increasing number of trials to the 
observed best building blocks. In this way the genetic algorithm is realizable yet 
near optimal procedure (Holland, 1973a, 1975) for searching among alternative 
solutions. (Goldberg, 1989); 

It is important to reiterate that genetic algorithms gain their speed, not by analyzing an entire 
search space, but from deciding which attributes (chromosomes) hold the least probability of 
producing interesting hypothesis and not testing hypotheses using those attributes. The process 
is not fixed, for it relies on probability for modeling, and different results will be derived each 
time the algorithm is run. This fact will be discussed further in the discussion of results 
validation. 

Now let’s bring this theory closer to the current research question. A hypothesis 
concerning the CCEP database may be “encoded” into a string representing its constituent 
attributes. If one is to hold with Holland’s theory, then the attributes (in this case demogr^hic, 
exposure, symptom, or diagnosis) which make up the hypothesis (in a group or hypotheses) 
having the highest fitness measure should be recombined in an exponentially increasing number 
of fashions. Similarly, the attributes from unsuccessful hypotheses should be recombined 
exponentially less often. Genetic operators, used in the DaMI genetic algorithm, prove be the 
most optimal way of accomplishing this selection. Finally, if this process is followed, then the 
extremely large search space of correlations within the CCEP database will be searched most 
efficiently using a genetic algorithm. It is on this theoretical basis that we chose a genetic 
algorithm to analyze the CCEP database. 
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2. Advantages and Disadvantages of the Genetic Algorithm Method 



There is a great deal of theoretical literature on the advantages and disadvantages of 
using genetic algorithms. It is the intent of this section to relate practical lessons learned from 
our specific research using DaMI on the CCEP database. From the point of view of this researdi, 
a genetic algorithm was particularly useful because of its ability to process tremendous amounts 
of data and its lack of need for human interaction. It has already been proven that CCEP problem 
search space is too large to analyze by conventional means, even with a computer. The problem 
caimot be stmctured strongly enough to limit the possibilities to realistic numbers, so technology 
is being relied upon to perform the discrimination. Medical research assets are a scare resource, 
so employing medical experts only at the fitness function creation and final analysis stages 
produces efficient and effective results. Should preliminary implementation of genetic 
algorithms prove informative in this area of medical researdi, many other similar research 
questions may benefit from this technology. 

There are several disadvantages to using genetic algorithms, several to which have 
already been alluded. First, as can be seen fix>m section n.D, a great deal of effort must be 
committed to database structure and normalization before processing. Since the system relies on 
computer evaluation of data, the data structure and coding scheme must be uniform and 
conducive to information extraction. Non-descriptive representations and textual data collection 
will severely curtail system performance. The strong coding and standardization of the CCEP 
database was one of the aspects that made it so attractive for this type of research. Second, a 
genetic algorithm is useless without a single, unambiguous representation of what is interesting 
to the operator. This was a key challenge to this research. There are many measures which may 
infer the “interestingness” of a particular hypotheses, but the synthesis of a single aggr^ate 
measirre whidi satisfies all components of epidemiological interest has been extremely difficult 
(several different fitness functions may be required). Finally, a difficult paradox arises when 
attempting to prove that a genetic algorithm has completely searched a large space. A genetic 
algorithm achieves its speed advantage by selective analysis, meaning it selectively eliminates 
search options with, ^parently, little chance of yielding interesting results. The only way to 
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actually prove that an interesting hypothesis was not missed is to physically test every 
hypothesis, but we turned to the genetic algorithm because the resources necessary to search the 
entire space were not available. To address this problem, the genetic algorithm is run several 
times. If the outcomes produced by several independent runs have a high intersection 
(particularly among hypotheses of high fitness), then there is strong evidence that the space has 
been searched adequately. A more detailed discussion of this challenge is included in Chapter V. 

To sum up, this research has foimd that genetic algorithms do search a very large space 
of alternatives very quickly and efiBciently. Successive generations of hypotheses quickly 
improve in quality as measured by the fitness fimction, and therefore the algorithm does adjust its 
search to the operator’s goals. Strong database standardization and coding are a must before any 
processing is attempted. A genetic algorithm has proven successful to this research, as long as a 
fitness function can be created which accurately defines ‘Svhat is interesting” to the researchers. 



E. KEY CHALLENGES TO CCEP ANALYSIS BY A GENETIC 
ALGORITHM 



1. Problem Structure 

The single most challenging aspect of this research is tiiat “Persian Gulf Syndrome” as it 
is referred to by the media, PGW veterans, and some researchers, is not yet really a defined 
syndrome at all. A syndrome must be defined by a unique series of symptoms and/or ailments 
whidi are shared by a specific group of individuals. Although many PGW veterans report a wide 
array of non-specific medical mlments associated with PGW service, no defined set of 
symptomatology has been enstantiated as a candidate syndrome. 

CCEP clinicians have identified a wide range of specific diabases (i.e. 
migraine headache, depression, asthma, arthritis, hypertension). However, few 
if any of the conditions diagnosed to date could be considered specific for any of 
the many different exposures implicated as potential causes of Persian Gulf 
illnesses. Thus as a case series, the CCEP has identified a wide spectrum of 
different clinical conditions rather than any singular homogeneous diagnostic 
entity (CCEP, 1996, p. 79) 
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While the medical implicadons of this statement are serious, the impact of this situation on 
research is tremendous. Basically, CCEP medical researchers cannot provide us with a 
description of a target syndrome for research, or for diat matter if there are one, many, or any 
syndrome(s) at all. Without target syndrome characteristics, a researcher is unable to identify 
which field or combinations of fields wi diin die database indicate a desired outcome (a syndrome 
of interest). In trutii, researchers do not know if the data necessary to identify a syndrome, 
should one exist, is contmned in the database at all. Therefore, we have been compelled to 
develop a tool which can examine “interesting” associations between any number of causative 
and outcome attributes without specificity as to the limits of either the causative or outcome 
space. This is both a curse and a blessing; the lack of specifics makes the problem considerably 
more challenging but also stimulates interest in our type of tool. 

What can be reasonably asked about the problem is the following: 

• Is there a syndrome? Is there subset a (of A) ailments such that the occurrence rate 
of a in PGW participants (G) is higher than the rate in a reference population (R)? 
[#a(G) equates to “number of occurrences of an ailment within the set of participants 
(G)] 



#g(G) #g(j?) 

#(G) ^ #{R) 

• What caused the syndrome? Is tiiere a subset i (of X) of exposures and/or 
demogr^hic experienced/attributed to participants in the PGW such that: for 
ailments a for which the prior equation is true, exposures/demogr^hics x accoimt 
for a significant part of the difference in occurrence rates of a in groups G and R? 



P{a\x,G) 



#a(G) #a(R) 

#x(G) "" #x(R) 



P(a\x,R) 
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The lack of precise target syndrome definition encourages the development of multiple 
research strategies. As mentioned before, the directed query technique used by CCEP (CCEP, 
1996, pp. 17 - 49) has sliced the database from numerous different perspectives. What is needed 
is a search tool which can examine multiple combinations of independent (LHS) and dependent 
(RHS) variables and all possible values for each variable simultaneously. This adds an extra 
dimension to the analysis. Conventional data mining tools typically allow the user to specify a 
range of possible LHS variables for search and a single RHS variable. Multiple RHS fields may 
still be handled under this doctrine by creating a pseudo field which contains a different value for 
each unique combination of values in the RHS fields to be examined. However, if the RHS 
fields for analysis are large in number or carmot be specifically identified, the pseudo field 
coding becomes unpractically large. What is needed instead is a data mining tool which can 
apply selective induction operators to a range of possible attributes (not just individual attribute 
and value instances) on the LHS and RHS simultaneously. 

This methodology is plausible and in &ct was done by DaMI in this research, but it is 
pmdent to note that this strategy will still produce an extremely large search space. For example, 
the first analysis done by DaMI examines the associations between 15 standard symptoms (LHS) 
and 2 1 possible diagnoses (RHS). All attributes are Boolean and are not limited in the number of 
simultaneous combinations (all symptoms and diagnoses could be simultaneously present or 
“true”). Therefore the possible search space is 2^^ or 6.8 x 10^** possible hypotheses. It is for this 
specific reason that we chose to use a genetic algorithm, with its ability to discriminately analyze 
tremendous search spaces. A test was conducted in which this particular problem was analyzed 
using simple “brute force” (test every possible combination indiscriminately), using a 486DX/66 
Mhz personal computer. The personal computer was able to test about 600,000 combinations per 
day. At this rate, this one complete analysis woirld take 1 14,992 days (315 years). Even if a 
platform were chosen that was 100 times &ster than our test personal computer, the analysis 
duration would be an unacceptable 3.15 years. 
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2 . 



Database Content and Structure 



Several problems were encountered during the course of this research with the CCEP 
database content and structure. These problems fall into two major categories: data 
representation anomalies which make it difBcult for an algorithm to extract meaningful 
information fiom die data, and data collection anomalies which introduce bias into the data being 
analyzed. Examples of data representation anomalies include irrelevant data and non-normalized 
data. These problems must be corrected before useful analysis can be conducted; they usually 
require modification of the database itself. In the case of CCEP, data collection anomalies 
include data that were self-reported by participants, self-referral of PGW veterans to the CCEP 
program, and lack of an established control group. Collection anomalies do not interfere widi 
analysis itself, but they must be acknowledged or accounted for when examining results. 

Seventy-seven fields in the CCEP database are simply unusable. Many fields contain 
sensitive unclassified data on the participants (names, social security numbers, addresses, etc.) 
which is not helpful for medical research and is subject to the Privacy Act of 1974. Those fields 
were deleted at the outset. Anodier larger group of fields is used by CCEP for administrative 
processing and ate similarly not helpful to research. Finally, diere were some fields that have 
been collected as non-standarxlized text. The most serious occurrence of this is die “chief 
complaint” or in other words the reason that the participant approached CCEP for an 
examination. No standardization was enforced in this fi'ee-text field so it is reladvely impossible 
for a computer to determine similarity between tuples, short of creating a complete index of chief 
complaint texts and some standard category indicator. This is fortunately not the case with 
diagnoses, which use the standard muneric ICD coding system. Participant complaint 
information was cqrtured in the form of fifteen standard symptoms, but a coded chief complaint 
would prove most helpful. 

A key shortcoming of die database, reported at the outset by CCEP, is the large amount 
of data which are self-reported by participants. Self-reported data are that which is directly 
determined by responses from participants during their medical examinations (as opposed to 
clinical test results, review of documentation, or impartial third-party observation). Self-reported 
data are analogous to a survey, which is in and of itself not a database flaw. However, in the 
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context of CCEP, all exposure and standard symptom data are self-reported. Tliis reduces the 
direct applicability of aggregate participant responses because perceived exposure may be 
distinctly different from actual exposure. This is most easily demonstrated by an example we 
call “the Botulism Illusion.” Within the CCEP database, 26.4% (4,500) of the active-duty 
participants report receiving the botulism vaccine. Now it is known from medical records that 
only 8,800 or 1.26% of the 697,000 PGW veterans were given this vaccine. This high 
percentage (26.4% of participants) would ^pear to suggest a possible relationship between the 
botulism vaccine and PGW medical ailments, until it is pointed out that 21.9% of the CCEP 
participants who were examined and deemed “healthy” (primary diagnosis of V65.5) also 
reported receiving the botulism vaccine. (See Figure #1) Problems concerning reported data 
may be compensated for by collecting and examining a “control group” of participants who do 
not have significant medical conditions; however, reported data should always be interpreted 
with some degree of caution. 



Reported by CCEP Participants Reported by “V65.5” Participants 




Figure 1. The Botulism Illusion 
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Another obstacle to a meaningful analysis of the CCEP database is the self-referral 
(participants made a conscious decision to start the CCEP examination process) of participants. 
As described in Appendix A, any individual who was eligible for medical care under the MHSS 
system in 1994 and had a health concern related to PGW service (whether directly or indirectly) 
could request a full medical evaluation under the CCEP program. This encouraged a wide range 
of participants, but the self-referral of patients may invalidate the CCEP database as a statistical 
representation of PGW veterans as a whole. Had die participants in CCEP been selected 
randomly, then their aggregate response and demographic data could have been considered 
statistically representative. In this case, the sheer act of self-referral introduces some level of bias 
which, if it can be identified, should be explained to the degree possible. One possible solution 
is to randomly select a suitably large group of PGW veterans, regardless of health concerns, and 
provide them with the same medical evaluation as the other, self-referred, participants. In other 
words, create a control group. A control group will help identify bias fi^om both self-reporting 
and self-referring. Unfortunately, this was has not been adopted as part of the CCEP program. 
Suggestions have been made to create a control group after-the-fact, but a strong argument can be 
made that the passage of time since 1994 will introduce similar bias into the responses of a 
present-day control group. 

The reader should not infer that the CCEP database is a poor source; it has many strong 
points. After removal of unusable fields and reformatting other fields for enhanced analysis, 140 
“good” fields have remained for analysis. One of the most positive aspects of the database, is the 
standardization of CCEP data collection. From the outset, CCEP used the same database 
stmcture, examination process, and coding scheme for aU medical examinations. There are some 
exceptions, such as the case of chief complaint (mentioned above) but overall the data content is 
strongly coded and standardized. Any reader who has dealt with data analysis at all, should 
appreciate the importance of a uniform database structure and coding system to computer 
analysis. Something as simple a representing an afBrmative response as “Y” or “Yes” or “yes” 
can make computer-based query fer more difQcult. Of particular significance was the uniform 
usage of numeric ICD codes to represent outcome diagnoses. 
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3. 



Database Normalization 



The unifonn coding scheme used in the CCEP database and limited need for scalar 
(continuous numerical) data sharply reduced the need for normalization (when used in a data 
minin g context, “normalization” means stmcturing a database for effective computer analysis). 
The coding scheme used in the CCEP database is quite strong, so only a few modifications were 
made to normalize the database. Three significant modifications were made to the schema for 
analysis. Diagnoses were converted fi’om single fields to multiple Boolean fields to facilitate 
analysis of diagnosis combinations. Standard symptoms were changed from durations to simple 
occurrence to simplify the ambiguity of comparing duration categories. Finally, an aggregate 
reproductive disorder field was created to relate reported reproductive disorders of any type. 



a. Boolean representation of diagnoses 

The CCEP database captures outcome diagnoses assigned by the examining 
physician as a primary diagnosis and six secondary diagnoses. CCEP researchers assign a 
somewhat higher emphasis to the primary diagnosis, and place little weight on the ordering of 
secondary diagnoses. Therefore, a medical researcher would not differentiate between a 
diagnosis of fodgue spearing second or say fourth on a list of diagnoses attributed to a 
participant. A computer on the other hand could consider these distinctly different occurrences. 
Since combinations are tantamount to this research, it is much easier to represent and analyze a 
string of diagnosis fields with Boolean (yes or no) operators than a string of up to seven 
unordered diagnoses. However, 1700 different diagnoses were assigned to the 19,000+ CCEP 
participants, so a pure Boolean representation would be extremely unwieldy. We decided to 
represent the twenty-one most fiequently occurring diagnoses as Boolean operators in addition to 
the existing ICD representation. The number twenty-one was selected arbitrarily (it can be 
expanded in future research), but at least one of the selected diagnoses is included in 74.7% of 
participant outcomes. See Figure #2 below. 
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Figure 2. Diagnosis Attribute Restructuring 



b. Standard Symptoms 



In the CCEP database, participants are asked to report suffering from fifteen 
standard symptoms (e.g. chest pain, difficulty breathing, head aches). The responses are 
collected dates of onset and duration. The date and duration are subjective (and subject to error), 
and like diagnoses, difficult for an automated search engine to compare. A higher confidence can 
be assigned to a response if it is represented as a Boolean (the participant will in most cases 
accurately report existence of the symptoms, while his/her ability to estimate an onset and 
duration is questionable). Therefore, fifteen additional fields are added to the CCEP database, 
one corresponding to each symptom and equal to “Y” if the participant reported the symptom at 
any time for any non-zero duration. 



c. Reproductive Disorders 

One of the high visibility aspects of the PGW is the possibility that a syndrome 
may be causing PGW participants to experience a higher rate of reproductive disorders 
(specifically birth defects). The CCEP database captures reproductive disorders (participant may 
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report reproductive disorder actually experienced by a spouse or manifested in o£&pring) in five 



areas; 



• Infertility 

• Miscarriages 

• Still births 

• In&nt deaths 

• Birth defects 

These five categories are fiuther subdivided into disorders experienced prior to and after PGW 
service, making a total of 10 reproductive disorder fields. We cannot be certain that a syndrome, 
should it exist, would cause only one form of reproductive disorder. Therefore, two new fields 
were created to reflect any reproductive disorder experienced by the participant, either prior to or 
after the PGW conflict. In other words, if a participant reported infertility, a miscarriage, a still 
birfti, an in&nt death, or a child with birth defects prior to PGW service, then the new field 
(PQ_prior) was set to “Y.” If none of these were experienced prior to PGW service, then 
PQ_prior was set to “N.” Similarly, if any of the five sub-categories were affirmatively answered 
after PGW service, then PQjafter was set to “Y.” This will allow the research to be more 
sensitive to associations between demogr^hic, e?qx>sure, symptom, and diagnosis data and any 
combination of reproductive disorders. Naturally, any interesting associations developed 
concerning these two new fields will need to be re-categorized by medical researchers before a 
finding may be made. 

After completion of normalization, 6 demogr^hic, 32 reported exposure, 15 (Boolean) 
standard symptom, and 21 (Boolean) diagnosis fields are available for automated analysis. 

These 74 fields observe a uniform stmcture and coding scheme and are the foci of this research. 
Please consult Appendix A for a detailed list of analyzed fields. 
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4, 



What is “Interesting?” 



In Section n.D. 1, we asked the question, “What is a syndrome?” It is necessary at this 
point to revisit this question, but from an automated analysis perspective. A genetic algorithm 
depends (as do many other techniques) on the ability of the researcher to define in quantitative 
terms what is ‘Interesting?” The problem in many forms of decision science is not whether a 
model performs accurately, but rather if it improves the quality of a decision. In a genetic 
algorithm, selection of hypotheses to evaluate is proportionally related to a “fitness” value for 
each hypothesis, so it is critical that our “fitness fimction” accurately represents the interest of 
medical researchers. This characteristic is reflected in the fimdamental genetic theory; 



“Roughly, the fitness of a phenotype is the number of its offipring which survive 
to reproduce... This measure rests upon a universal, and familiar, feature of 
biological systems: Every individual (phenotype) exists as a member of a 
population of similar individuals, a population constantly in fha because of the 
reproduction and death of the individuals comprising it. The fitness of an 
individual is clearly related to its influence upon the future development of the 
population. When many offspring of a given individual survive to reproduce, 
then many members of the resulting population, the “next generation , " will 
carry the alleles of that individual. ” (Holland, 1975, p. 12) 

This returns us to the fimdamental question; “What is interesting to CCEP medical researchers 
and how wiU that interest be manifested in the database?” In Section n.D. 1, we stated that we 
are not sure whether a syndrome exists, and, if it does exist, we are not certain that the data 
captured in the CCEP database are ^propriate to identify it. However, if these two uncertainties 
are removed, the following assertions can be made; 



• If there are one or more syndrome(s) affecting PGW veterans, the data to identify 
them may already exist in the CCEP database but is hidden by the sheer volume of 
data, 

• In this case, a syndrome will manifest itself as a single or unique group of diagnoses 
or symptoms shared by a cluster of participants sharing some common exposure 
and/or demographic attribute(s) 
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By plimging directly into a search for associative relationships between risk factors and 
outcomes, we bypass a fundamental step in classical epidemiological technique. Normally, 
epidemiologists will first define the outcome diagnoses and/or symptomatology which describe a 
prospective syndrome. Once the definition is made, then research eflforts are focused on 
associations with risk factors and other exposure sources. Unfortunately, the present research is 
left with a less than optimal situation. We suggest that a promising use for a genetic algorithm is 
to give clues to medical researchers that help them define a syndrome. 

In this research, we have accepted that conventional research methods alone may not be 
able to define and isolate a syndrome affecting PGW veterans. We are now led to re-examine the 
problem fix>m different perspectives. Our research qiproach has be guided by the following 
ideas: 



• We are not trying to create an analysis that will isolate a single pre-defined Desert 
Storm Syndrome. Instead we are defining a profile that a syndrome might follow, 
should it exist. Our goal is to determine how a possible syndrome would be 
reflected in the data, as discriminately as possible, and then constmct a fitness 
function which is ^propriately high when this profile is met. 

• Our genetic algorithm does not find a Desert Storm Syndrome, but rather distills the 
billions of possible hypotheses into a set of hundreds. All in the set of candidate 
hypotheses are not syndromes, but if a syndrome(s) does(do) exist, it(they) will be 
foimd in the candidate set. This smaller set of candidate hypotheses may realistically 
be examined mote exhaustively by medical researchers and other conventional 
means. 

• By implementing the genetic algorithm as a precursor to medical research (and 
alleviating the idea that it must find “the answer^’), we allow the genetic algorithm to 
significantly reduce the burden on the relatively scarce medical research assets at a 
relatively small cost to the organization. In more basic terms, the secret to operating 
genetic algorithms in an imperfect world is to allow them to do the first 80% of the 
analysis work with only 20% of the research cost. 



27 



With the question of “interest” now bounded, a proper fitness fiinction may now be 
pursued. If a true syndrome does exist, then it is “caused” by something. Therefore, the 
participants will share some finite set of exposure mediums, or in other words all participants 
with a syndrome will share some commonality in exposure. This must be caveated by saying 
that the CCEP database may or may not contain the demographic and exposure elements to 
identify that commonality of exposure. But as our research mindset states, we are only 
attempting to establish the profile of a syndrome if it exists, and if the data necessary to identify 
it is contained in the CCEP database. If the prior statement is true, then there will be a relatively 
strong association between a finite set of exposure/demographic attributes and a unique 
combination of outcome diagnoses. Likewise, there will be a strong association between a finite 
set of exposure/demographic attributes and a specific combination of standard symptoms. The 
intersection between diagnoses and symptom combinations with similar exposure associations 
will profile a candidate syndrome. See Figure #3 below. 
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Now our question of “what is interesting?” can be defined. “Interesting” is combinations 
of RHS attributes (dependent variables) which are highly dependent on combinations of LHS 
attributes (independent variables), or in other words, the candidate dependent variables are truly 
determined (not independent of) by the candidate independent variables. The fitness function 
used must be such that hypotheses which demonstrate this property will be assigned a relatively 
high fitness value. There are numerous accepted functions in statistical literature that fit this 
requirement. Several of these are discussed in the next section. 



a. Conventional Epidemiological Measures 



A great deal of literature already exists, like (Goldberg, 1989) and (Holland, 
1975), to support the idea that genetic algorithms are quite successful at adaptively improving the 
quality of tested rules to suit the provided fitness function. From the outset, our genetic 
algorithm demonstrated this quality. However, the greatest challenge has been to ensure that the 
search model adequately represents the research questions (i.e. the genetic algorithm is doing 
what it was told to do, but have we provided it with relevant, meaningful instractions?). As a 
starting point for development of the fitness measure for this research, we first turned to classical 
epidemiology literature. 

Classical epidemiology evaluates any test in terms of four variables (see Figure #4 
below) which describe how successfully a test predicts the actual presence (or lack) of a specified 
disease. This is much akin to our own research which attempts to identify the success of a single 
or multiple exposure and/or risk fector attributes predicting a combination of symptoms or 
clinical diagnoses. In epidemiology, these four variables {a, b, c, d} are computed using a two- 
by-two matrix of test results and actual disease presence. 
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Disease 






Present 


Absent 




Positive 


a 


b 


PV(+) 




True Positive 


False Positive 


a/(a+b) 


Test 










C 


d 


PV(.) 


Negative 


False Negative 


True Negative 


d/(c+d) 




Sensitivity 


Specificity 






a/(a+c) 


6/(b+d) 





Figure 4. Classical Epidemiological Measures 



By mathematically manipulating these four variables, four “quality” values are obtained from the 
relationship between the subject test and subject disease. In each case, keep in mind that our 
research is ^plying the risk/exposiue as a test for (or indicator of) a specific symptom and/or 
diagnosis profile. These quality values are (Fletcher, 1982, pp. 43 - 57): 



• Positive Predictive Value. Indicates the ability of a positive test result to accurately 
identify the presence of a disease in a patient. This term is similar to “confidence” used 
as a fitness measure in many data mining tools. We term this “forward confidence.” 

a 

a-¥b 

• Negative Predictive Value. Indicates the ability of a negative test result to accurately 
detennine die absence of a disease in a patient. Most data mining tools do not consider 
this measure, but recommend the analysis be run with swapped dependent and 
independent variables. This is not practical if multiple dependent variables are being 
analyzed. 




PV{-) = 



d 

c -^d 
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• Sensitivity. The proportion of subjects with a disease who have a positive test for the 

disease. A sensitive test will rarely miss people with the disease. 

... a 

sensitivity 

a+c 

• Specificity. The proportion of subjects without the disease who have a negative test. A 
specific test will rarely misclassify people without the disease as diseased. 



specificity = 



d 

b+d 



b. Fitness Measure Paradoxes 

In our research, classical epidemiology measures are helpful in choosing a 
suitable fitness function, but no single aforementioned measure is sufiBcient for several reasons. 
Rather we desire an aggregate fitness measure which will increase in response to any classic 
measure of interest. Fundamentally, this research problem differs firom clinical test evaluation in 
one respect. While a high number of either folse positive (b) or fidse negative (c) tests is a 
coimter-indication of a test’s quality, it is also desirable (in our case) if a risk/exposure 
combination is contraindicative of an outcome symptom/diagnosis set. In certmn cases, a true 
positive may mean nothing because there are also many felse positives. In other cases, a 
simultaneously high folse positive and folse negative is quite informative. This is best described 
by an example (Figure #5), but basically, in the case of CCEP database analysis, we are most 
interested in the hypotheses having highest values and lowest values of sensitivity and 
specificity. 
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-^Consider the most simple hypothesis, 1 LHS (L) and 1 
RHS (R) field. 

• IfL and Rare Boolean, there are four possible hypotheses to test. 

• We are looking for more than just a highprob(R=“yes”|L=“yes”). 



INTJiRESTlNO NOT IN 



IF L = “yes” THEN R = “yes” 


90% 


IF L = “yes” THEN R = “yes” 


10% 


IF L = “y cs” THEN R = “no” 


10% 


IF L = “ye^ THEN R = “no” 


90% 


IF L = “no” THEN R = “no” 


80% 


IFL = “no” THENR = “no” 


80% 


IFL = “no” THENR = “yes” 


20Vo 


IF L = “no” THEN R = “yes” 


20% 



^As die number of fields and/or values per fidd increases, the 
problem e}q>ands e}q>onentially 



Figure 5. Attribute Value Relationships 
c. Alternative Fitness Measures 

Now that our concept of “interesting” has been fiamed from the epidemiological 
perspective, we can set about the task of selecting a single fitness measure which mathematically 
describes our concept of interest to the genetic algorithm. Again, there is some challenge in this 
because there are several different measures of interest to medical researchers (discussed in the 
previous section), yet the genetic algorithm requires a single aggregate fitness measure. The 
genetic algorithm could be run several times using different fitness measures, but this carries a 
high cost in both processing time and post-processing analysis effort. Likewise, we have seen 
fix)m the preceding section that reliance on any single measure carries with it the possibility of 
statistical misinterpretation. Two paths were examined in this research to address this problem, 
although we note that there may be many other possible solutions. 

• Modified J-measure. Refer again to Figure #4 and the four test characteristics 
[PV(+), PV(-), sensitivity, and specificity]. Our first ^proach was to create a 
measure which was suitably large when any of these four measures were large and 
suitably low when none of the measures were relatively large— in effect an aggregate 
fitness measure. It should be noticed from the foundation we have laid that if both a 
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and d are relatively laige when compared with b and c, the four test characteristics 
are all relatively large. This would demonstrate that the risk Actors and/or e?q>osures 
under investigation are highly successful in predicting the outcome symptoms and/or 
diagnoses under investigation. Tentatively we will select the following fonnula as 
our fitness measure; 

, - . axd 

modjiJUness) = 

bxc 

It may also be noticed that this measure will ejQfectively indicate if the outcome 
symptoms/diagnoses are successful at predicting the risk/exposures. We call this 
property, “reverse confidence.” It is paiticularty helpful to examine the two sets of 
attributes with each assuming the role of dependent and independent variables 
simultaneously. Finally, recall that unlike the evaluation of clinical tests, CCEP 
analysts consider it interesting if both Mse positive and &lse negative values are 
simultaneously high (indicating a risk/exposure combination reduces the probability 
of a symptom/diagnosis combination). To account for this situation, our j-measure is 
modified as follows 



...axd. , , . axd 

i/(- ) ^ l,mod_j = 

bxc bxc 

...axd. , , . bxc 

if{- ) < \mod_j = 

bxc axd 

(Figure #6 gives an example of a modified j-measiue calculation; note we use a 
natural log fimction to shape the fitness fimction for better genetic competition; this 



will be discussed in Ch^ter V): 



33 





mod i -measure = 


1 +ln[(a*b)/(c*d)] 




1 +ln(ll=“7505)/(84 


♦146) = 2.91 




Fatigue 






“yes” 


“no” 




“yes” 


a 


b 


PV(+) 




11 


84 


11/(11+84) 


Uranium 






= 11.6% 


Exposure 








“no” 


c 


d 


PV(-) 




146 


7505 


7505/(146+7505) 








= 98.1% 


Sensitivity 


Specificity 


11/(11+146)=7.0% 


7505/(84+7505)=98.9% 



Figure 6. Modifled J-measure Calculations 



• Chi-square. Another approach to the question of fitness fiuiction may be derived 
strictly fi'om statistics. Since our aim is to identify risk factors and/or exposures that 
are highly associated with symptom and/or diagnoses groups, we may use a 
statistical principle which measures the independence (not the same as the term 
“independent variable” used in knowledge discovery science to denote the RHS 
variables) of two groups of attributes. According to Walpole, et. al, ‘The chi-square 
test procedure... can also be used to test the hypothesis of the independence of two 
variables of classification.”(Walpole, et. al., 1988, pp. 343 - 346) The same 
“contingency table” used by epidemiologist, may be constmcted and used to 
compute expected levels of a, b, c, and d based on the joint probability function of 
the dependent and independent variables. (See Figure #7) Observed values are the 
original values of a, b, c, and d, and expected values are calculated using the 
following formula: 

, ,r . (column total) x (row total) 

Estimated _ Expected _Value - = ^ = 

grand _total 
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The chi-square is now calculated and sununed for all cells in the matrix. {Chi-square 
may be used for any size matrix, in this case two were used for simplicity. Since a 
two-by-two matrix is used in the example, the formula below contains the Yates 
Correction, which is not necessary in larger matrices.) A higher chi-square 
indicates a higher level of dependence (or lack of independence) between the two 
attribute sets. The Chi-square formula (with Yates correction) follows; example chi- 
square calculations are included in Figure #7 ; 







2 



Eu 



yes 

Depleted 

Uranium 

Exposure 



chi-square„=(l 1 - 1 .93-.5)2/l .93=38.05, 
chi-square(tot) = 39.32 
Fatigue 

“yes” 



11(1.93) 


b 

84(93 07) 


c 

146(155.07) 


d 

7505(7495.93) 



157 



7589 



95 



7651 



7746 



Figure 7. Chi-square Calculations 



The modified j -measure has been used by this research to date, however anew statistical analysis 
package designed to analyze using chi-square is currently being constmcted. A more straight- 
forward formula for Chi-square will actually be used in the new statistical analysis package 
(Dixon and Massey, 1969, pp. 242 - 243): 

X ^ = (fid -bc\- — Nf N 
2 

{a + b){a + c){b + d){c + d) 
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III. SOLUTION CONCEPTS 



A. RESEARCH GOALS 

In the case of the Desert Storm research, years of conventional medical research have 
yielded no single syndrome or associated symptomatology set. This means that the no fixed 
dependent variable set (combinations of diagnoses and/or reported standard symptoms) can be 
readily identified. The traditional epidemiological paradigm is to isolate a group of individuals 
with consistent symptoms/outcome diagnoses and then find what key demogr^hic or exposure 
elements these individuals share. If relating demogr^hic/e^qrosure data are present, it is used to 
focus clinical research on an underlying cause. This ^proach has not proven firuitful to date, 
either because no syndrome exists or because the sheer volume of data in the CCEP database 
hides a relation of interest fiom hiunan-controUed querying. Therefore, we have chosen to let 
technology simplify the problem fiom the outset of the knowledge discovery process. 

As mentioned before, there are four basic categories of useful data contained in the 
CCEP database {demogr^hics, reported exposures, reported standard symptoms, and outcome 
diagnoses}. While attributes in each category could prove useful as independent (LHS) or 
dependent (RHS) variables, it is doubtful that attributes fiom the same category will be useful as 
both LHS and RHS simultaneously. The research question is now simplified to an examination 
of which attributes (or combinations of attributes) in each category are most highly associated 
with (or statistically dependent on) which attributes fiom another m^'or data category. 

EXAMPLE What associative relationships exist between exposure attributes and 
outcome diagnosis attributes? Based on analysis, there is a high association between 
reported exposure to Scud Attack and Depleted Uranium and an outcome diagnosis of 
Post-traumatic Stress Disorder. [This is Just an example, not an actual finding] 

This exponentially increases the size of prospective search space which is represented by 
2 #lhs * 2*®”*(where #LHS = number of independent fields and #RHS = number of dependent 
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fields and all attributes are Boolean; if not the search space is even greater). The increase in 
search space can provide useful insight to medical researchers as they develop hypotheses. 
Instead of wailing for medical researchers to provide a more structured problem (and thereby 
reduce the search space), it was our feeling that an intelligent search technique could be 
employed eflfectively in the problem as given. Therefore, the role of our genetic algorithm is to 
test an extremely large subset of all fields in the CCEP database concurrently for levels of 
interest based on a specific model of epidemiological interest, to wit: 

Q(LHS*,RHS*) = msx(Q iLHS',RHS')) 

where LHS’ c LHS * and RHS'cRHS* and0Q= fitness function 

We did coimt on CCEP medical researchers to define their concept of "interesting" and 
thereby guide our selection of an ^propriate fitness function. This fundamental shift in 
knowledge discovery technique suggests that a genetic algorithm may be used to provide 
leseardiers with information to assist them in fiaming the initial research strategy, instead of 
fiaming the problem and then passing it to a genetic algorithm. We asked the following question, 
"If a syndrome does exist and the data necessary to identify it are contained in the CCEP 
database, what data relationships would it create in the CCEP database?" The answer to this was 
converted to a mathematical fitness measure. The resulting combinations of 
exposures/demographics and symptoms/diagnoses discovered will contain any identifiable 
s3ndromes', but the entire set of hypotheses will not all be guaranteed to be useful solutions. The 
goal is to present medical researchers with a more workable solution space in which to focus 
their conventional research efforts. This approach shifts the burden of searching a tremendous 
alternative space ^propriately onto the genetic algorithm. 
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B. SOLUTION STRATEGY 



Our solution strategy takes two fonns, theoretical and practical. In the theoretical sense, 
the solution strategy rests on selection of the most efficient method of searching an extremely 
large solution space. There are three basic methods of search: 

• Random. In this type of search, a computer program will randomly generate 
hypotheses and pass these hypotheses to an evaluating routine. The evaluating 
routine assigns a fitness measure to each hypothesis based on the fitness fimction 
provided. If the hypotheses are generated sequentially, this method is also know as 
“brute force.” This method tests many hypotheses, because the hypothesis 
generation apparatus is extremely simple, but has no capacity to self-improve or tune 
the search to the operator’s goals. 

• Human-controlled Selective Search. In this case, a human formulates a hypothesis 
and translates it into the form of a query. The query is evaluated by the computer 
system and the results are returned to the human operator. It is assumed that the 
human operator draws upon practical knowledge of tiie problem and the results or 
prior queries to formulate new queries. Therefore, the quality of query formulation 
improves throughout the process. This allows tire search to self-improve (including 
the human operator within the boundary of the search system) and obviously tune to 
the operator’s goals. However, the hypothesis generation is extremely slow. 

• Systematic, Intelligent, Automated Search. A computer program (genetic 
algorithm) generates hypotheses, passes them to an automated evaluator, receives 
results, and then re-generates a new set of hypotheses {systematically adapting its 
search based on its past performance as indicated in the results received). This 
technique demonstrates all three desirable search characteristics: &st hypothesis 
generation, self-improvement, and tuning to the operator’s goals. 
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Figure #8 illustrates the comparative advantages of each search technique. It should now be 
clear, from a theoretical point of view, why a (genetic algorithm) systematic, intelligent, 
automated search has been chosen. 




Figure 8. Characteristic of Different Search Techniques 



Now let us discuss the solution strategy on a mote practical level. Assume for a moment that a 
genetic algorithm performs a systematic, intelligent search as theorized. The next section will 
provide a theoretical basis for this assumption. From Section II.D.4, we draw the premise that a 
syndrome will manifest itself as a high association between a specific combination of 
demographic and/or exposure attributes and a finite set of symptomatology or diagnoses. 
Combine this with premise that either a modified j -measure or chi-square formula will indicate 
the level of association (or dependence) between two sets of attributes. Our strategy is then to 
instmct the genetic algorithm (DaMI) to find the most significant associations between 
demographics/exposures and symptoms and between demographics/exposures and diagnoses. 
These two analyses will divide the compete set of possible combinations of 
demographics/exposures into three categories (note that demogr^hics/exposures are traditionally 
viewed as the independent attribute set); 



40 



• Demographic/Exposure combinations which appear on neither analysis. Any 
hypothesis not contained on either study indicates that there is no statistical basis 
within the CCEP database to indicate that combination is a possible syndrome. This 
does not mean that it could not suggest a syndrome; as stated before, the CCEP 
database may not cq}tuie the ^piopriate data to identify the hypothesis as a 
syndrome. 

• Demographic/Exposure combinations are associated with both specific 
combinations of symptoms and specific combinations of diagnoses. This is the 
ideal case for suggesting the existence of a syndrome. It indicates that a group of 
PGW participants, sharing both a common symptomatology and outcome diagnosis 
set belong to the demogrqjhic profile and/or report common e?qK>sure elements. 
Clinical research should be directed toward a prospective syndrome demonstrating 
the listed symptoms and diagnoses. Again this indicates tiiat a hypothesis meets the 
mathematical definition of interesting, but the possibility of it being a syndrome can 
only be confirmed by evaluation by medical professionals. 

• Demographic/Exposure combinations are associated with either specific 
combinations of symptoms or diagnoses. A majority of hypotheses identified by 
DaMl will foil into this category. If only one correlation is made with the 
demographic/e?q)OSure data, there is a weaker indication that this particular 
combination signals a candidate syndrome. However, foilure to appear on both 
analyses should not completely discount the hypothesis. As mentioned before, the 
foilure of the CCEP database to capture all symptomatology or diagnoses may 
eTqjlain the ^pearance of the demographic/exposure combination on only one 
analysis. Therefore, hypotheses in this category should still be evaluated by medical 
professionals. 

Naturally, a certain degree of ambiguity exists concerning the specific fitness measurement 
thresholds with respect to interest (filtering). Filtering will be discussed in Ch^ter VI. But in a 
practical sense, this analysis will provide medical researchers with a prioritized list of interesting 
associations. The central point is that most possible hypotheses will prove statistically 
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implausible and therefore fell into the first category, suggesting they not receive costly 
conventional medical research efforts. 

Finally, many initial DaMI discovery sessions were devoted to analyzing relationships 
between reported symptoms and outcome diagnoses. Early input fiom CCEP epidemiologists 
included a strong desire to identify unexpected symptom/diagnosis combinations. This study 
was q}pealing for initial reseaidi because all attributes involved were Boolean (as opposed to 
demographic and exposure attributes having more than two possible values). The research 
proved statistically successful (discussed in Chapter VI) but of limited practical value to CCEP. 
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IV. DaMI GENETIC ALGORITHM ARCHITECTURE 



Up to this point, this thesis has focused on the theoretical structuring of the CCEP 
research problem and formulating the qualities of a genetic algorithm required to solve the 
problem. The second half of this thesis will focus on describing the tool developed to meet these 
diallenges and the success of that tool in actual analysis. Based on the preceding discussion, the 
genetic algorithm must be specifically designed: 

• to accept an unstructured set of dependent and independent variables 

• efficiently search an extremely large search space 

• employ adaptive learning, where a priori information is used to guide fixture 
hypothesis testing 

This chapter will deal with DaMI from a macro systems perspective; Ch^ter V will address the 
details of the system’s design. 



A. PROGRAM MODULES 

Unlike many other genetic algorithms, the system designed for this research (DaMI) has 
been using several independent modules. These modules consist of the genetic algorithm itself a 
statistical package, a user interfece, and a verification package. There were two primary reasons 
for this design strategy. The first was to relieve the genetic algorithm of the mimdane analysis 
tasks, results filtering, and user inter&ce tasks, thereby enhancing the space searching efficiency. 
The second reason was to aid in system development. By adopting a moditlar development 
cqxproach, a great deal of effort can be focused on the core genetic algorithm technology and 
allow the system to begin rapid prototyping before optimal statistical analysis and user interfoce 
modules were developed. Once the core genetic algorithm is properly fimctioning, more robust 
statistical engines and user options may be added, using experience gained fiom test runs. A 
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more in-depth explanation of the genetic algorithm (GA) operation is contained in the next 
chapter. Figure #9 shows the relationship between the DaMI modules. 
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Figure 9. Relationship of DaMI Modules 



1. The Genetic Algorithm Package 

The genetic algorithm package is responsible for maintaining a list (population) of 
hypotheses (rules) in the current generation, selecting the most successful rules, and performing 
the genetic operations of reproduction, crossover, and mutation. These genetic operators allow 
the system to adapt the analysis to the goal model (fitness function) and improve the search 
hypotheses as each generation is processed. In this thesis, “hypothesis” and “rule” are used 
interchangeably; "hypothesis" is a medical research term and "rule" is a artificial intelligence 
term. Clearly, not all possible hypotheses will be tested (hence the advantage of the genetic 
algorithm), but the use of genetic operators ensures that the rules being tested have the highest 
probability of satisfying the given fitness function (Holland, 1975). In the DaMI system, the 
genetic algorithm stores hypotheses as combinations of attributes only, not as combinations of 
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attributes and specific values. Competition is based on success of attribute sets as a whole. 
Attribute sets (like gender, receiving the botulism vaccine, exposure to uranium [independent 
variables] and Depression and Chronic Fatigue Syndrome [dependent variables]) are passed to 
the statistical package, which returns an aggregate fitness value for all possible value 
combinations of those attributes. The statistical package is called recursively during the 
processing of a single generation for every rule, until the entire generation is evaluated. Then the 
genetic algorithm produces the next generation and the process is repeated. 



2. The Statistical Analysis Package 

The statistical analysis package receives a set of independent and dependent attributes to 
evaluate fiom the genetic algorithm package. The statistical package requires no information 
other than a list of field names to evaluate. The number of attributes in each request sent to the 
statistical package varies, so it must be curable of processing loosely bounded problems. 

During pre-processing, the analysis database (database under analysis; in this case the CCEP 
Persian Gulf War Database) is examined and a table is created of all attributes and their possible 
values. This table is used as the source for generating each individual query (there are many 
individual queries generated to answer each request form the genetic algorithm) and ensuring that 
each possible combination is tested but only once. The statistical package then computes the 
fitness of each possible attribute/value combination. An aggregate fimess measure is then 
computed and returned to the genetic algoritiim package. As the statistical package tests 
attributes against the database under analysis, it also performs a test of each attribute/value 
combination against a second database. This second test is not returned to the genetic algorithm 
and therefore does not affect hypothesis competition. This value is stored to be used later for 
results validation (see section V.C). 
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3. 



User Interface 



The user interfece controls interaction between DaMI and the system operator. The user 
interface allows the user to adjust tunable parameters (discussed in Chapter V), view the 
discovery database at various stages of processing, and start and reset the genetic algorithm 
package. The user interface also provides intermediate feedback to the user during DaMI 
operation. It was designed using the Foxpro Screen Design Wizard and is controlled by push 
buttons and pop-up menus. Settings may not be adjusted “on-the-fly” when the genetic 
algorithm is operating. An example of the user-interface screen is shown in Figure #10 below. 
The user-interface module is disposable, and therefore an in-depth discussion of the user- 
interface design is not included in this thesis. 
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B. REPORTING AND FILTERING 



Once a discovery session has been completed by DaMI, several files are created. A 
transcript of each hypothesis individual (at the attribute level) of every generation is created as 
DaMI operates, along with a transaction record of each genetic operation employed, the source 
(parent) ntles, and resulting of&pring. The transaction record also maintains a time stamp at the 
start of each generation which can be used to monitor processing speed. DaMI also records how 
many actual combination were tried during the session. These files will not be discussed in 
detail (file stmctures are contained in Appendix B). 

The most important file created (nrlelib.dbQ contains a list of every hypothesis tested and 
used to determine an aggregate fitness measure (without duplication). Several key points must 
be cleared up at this juncture. First, not every possible attribrrte/value combination is used to 
compute the aggregate fitness value of a given attribirte set (this is a tunable parameter). Second, 
Rulelib.dbf stores attribute and value combinations (as opposed to the session transcript which 
records only the higher-level attribute sets). It also contains the intermediate, final, and 
verification fitness measures. This makes rrrlelib.dbf the actual answer produced by DaMI. 

Figure #1 1 is an excerpt firom rulelib.dbf 
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Figure 11. Rulelib.dbf Display 



Finally, whatever fitness measure is used will probably not have an arbitrary threshold of 
“interest.” A fitness measure is only usefiil in ranking the relative interest of hypotheses tested; 
therefore some form of filtering will be done prior to reporting. However, it is inadvisable to 
enforce that filter during operation. Instead, rulelib.dbf is left in the most robust (non- 
summarized) form practical; filtering is performed arbitrarily using SQL type query language on 
a case-by-case basis for each report. 

Several reports have been developed in Foxpro for the DaMI system. However, as with 
filtering, reports are tailored to suit the needs of each individual recipient. Summary reports are 
created on an ad-hoc basis; there is a standard detailed report which contains hypotheses and all 
intermediate and final statistical computations. The detailed reports (two main studies were 
conducted in this thesis) of die top 100 hypotheses discovered are contained in Appendix C. 
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C. SYSTEM REQUIREMENTS 



1. Hardware and Software Requirements 

From the outset, the author’s goal was to construct a research tool and methodology that 
can be employed by researchers in their community, without the need for a laboratory of (scarce) 
high-power computer assets. In any case, it has already been shown that raw processing power is 
quickly overcome by large unstructured database analysis requirements. Therefore, a genetic 
algorithm is used to intelligently enhance the processing c^abilities of whatever platform it runs 
on. In keeping with this goal, DaMI was designed to operate on a standard personal computer 
using inexpensive commercial software. The hardware and software requirements required to run 
DaMI are listed below; 

Hardware Requirements 

Personal Computer, 80486/66Mhz processor or better 

8 Megabytes of RAM 

200 Megabytes of fiee hard disk storage 

Software Requirements 

Microsoft® Visual Foxpro version 3.0 

Microsoft® Windows version 3.xx or Windows 95 

Surpassing the minimum hardware requirements will of course benefit system performance. The 
most dramatic performance improvements will be realized by increasing RAM and the access 
speed of the PC hard drive. 
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2. Processing Limits 



DaMI is primarily limited by the time available to the user to complete the analysis; 
however, there are some processing limitations. For the preservation of system speed, DaMI 
maintains the active population in a RAM-based array. Therefore, it is limited by the maximum 
array size allowed in Foxpro. The required array size is a function of population size per 
generation and number of attributes under analysis. The formula for this metric is: 

population _size x analysis _ fields < 73,500 

Under this limitation, analysis of 70 field with a population size of 15,000 (array size 1,050,000) 
would exceed the system limits. Only the number of fields actually under analysis is used in this 
calculation, not the number of fields in the database being analyzed. Also, the number of records 
in the analysis database is limited only by the maximum Foxpro table size (Maximum records 
portable file = 1 billion. Maximum size of a table file = 2 gigabytes. Maximum fields per record 
= 255 ). Naturally, larger files will take longer for the statistical package to analyze. 
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V. SEARCHING THE HYPOTHESIS SPACE: DaMI 
IMPLEMENTATION 

A. THE GENETIC ALGORITHM 



The basic architecture of the DaMI Genetic Algorithm is based on (Goldbeig, 1986), 
with the notable exception that our genetic algorithm stores rules as strings of Boolean attributes 
("trae"=cons?der the attribute-, "false"=^o«V consider the attribute). This allows the genetic 
algorithm to process simple binary strings, as opposed to strings of field values and wildcards 
(Goldberg uses a to denote any value of this attribute is acceptable). This does not imply that 
tire genetic algorithm is simplistic, in &ct competition of attributes in aggregate actually provides 
for a more eflScient search of the alternative space. As can be seen in Figure #12, a conventional 
genetic algorithm will operate hypotheses as combinations of attributes and values. In our case, 
this prevents the genetic algorithm fix>m considering the associations between risk Motors 
(exposures/demogr^hics) and outcomes (symptoms/diagnoses) in aggregate. By using the 
DaMI methodology, risk factors and outcome associations (hypotheses) are examined 
comprehensively before competing for selection and genetic recombination. 
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Conventional Genetic Algorithm Representation (Goldberg, 1989) 
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DaMI Genetic Algorithm Representation 
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Figure 12. Conventional and DaMI Algorithm Representations 



This genetic algorithm uses a "roulette wheel" (Goldberg, 1989) model for competitive 
selection with the size of each rule's "slice" (or probability of selection) being directly 
proportional to the fitness measure (determined by the statistical package) of each rule. Slices are 
selected for reproduction, crossover, and mutation randomly, but the "size" of each slice gives a 
proportionally higher chance of survival to rules with higher fitness. As individual rules show 
reproductive dominance, these individuals may possess more than one slice on the roulette 
wheel, (i.e. a particularly strong rale may reproduce more than once per generation, giving it 
mote than one slice on the subsequent generation's roulette wheel). We chose the roulette wheel 
(Goldberg, 1989) because it allows the stronger rules to dominate more quickly than with other 
methods (e.g. tank or tournament) and thereby converge fester. The basic genetic operators 
(reproduction, crossover, and mutation) ate all implemented in DaMI, with operator adjustable 
profiles (see section V.D). 
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B. THE STATISTICAL ANALYSIS ALGORITHM 



The DaMI statistical package in use is a ^rly simple algorithm. The modular design of 
our ^stem allows for the replacement of this statistical package with a more robust commercial 
package in the future. At this point, the cost of designing an inter&ce outweighs potential 
benefits; this may not be true for more complex analysis projects. 

Given a set of dependent attributes (RHS) and independent attributes (RHS), the 
statistical package creates a two-dimensional array of attributes and possible values. The array 
also contains the number of possible values for each attribute and a counter for each attribute. As 
the statistical algorithm processes each combination, file counter for eadh attribute is incremented 
accordingly using the base counting of each attribute corresponding to that attribute's number of 
possible values, (i.e. if the attribute "gender” had two possible combinations then its counter 
would increment in base 2; if the attribute "state" had fifty combinations then its counter would 
increment in base 50). The algorithm uses each individual attribute’s current counter value to 
reference a cell in the array. Ihe cell values and attribute names are used to create a textual queiy 
statement. The query statement is then applied to the analysis database and the fitness measure is 
applied to the result. This allows the same statistical algorithm to loop recursively with a 
minimum amoimt of software code, regardless of the number of attributes passed to it by fire 
genetic algorithm. 

Several fitness measures have been used (see the discussion in section n.E.4). Our goal, 
since medical researchers seek associations between patient risk fectors/e?q)osures, reported 
symptoms, and resulting diagnoses, is to award the highest fitness values to those LHSs and 
RHSs which are most highly interdependent (vice independent). Since each request fi'om the 
genetic algorithm generates many individual statistical package queries, some means of 
aggregating the fitness measures of all possible combinations is required. Several different 
mefiiods for determining the aggregate fitness measure were considered. Obviously, an average 
of all fitness measures for a given attribute set is non-competitive. In many cases, the highest 
individual fitness measure has been used because of the specificity of the research question. In 
other cases, an aggregate measure may be taken using Chi-square or an average of the top three 
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or four j-measures (use of an aggregate value limits the awarding of a high fitness measure based 
on a single une?q}ected outlier in the research database). 

A rule cacher (like a disk cacher, except for hypotheses) is used to prevent duplicate 
evaluation of any rule throughout the discovery session. A table of rules evaluated by the 
statistical package and resulting fitness values in maintained. Before sending a rule to the 
statistical package, the genetic algorifiim checks the table of rules already evaluated. If the rule 
has been previously evaluated, the genetic algorithm uses the fitness value fi'om the cache table. 
If not, the genetic algorithm package sends the rule to the statistic^ package and establishes a 
new entry (with resulting fitness) in the cache table. 



C. TUNABLE PARAMETERS 

The program has several tunable parameters to adjust genetic algorithm operation. 
Tunable parameters are set via the user inter&ce at the commencement of each discovery session. 

• Crossover probability, probability that a selected rule will exchange information with 
another selected rule 

• Mutation probability, probability that a selected rule will imdergo a random mutation 

prob{reproduc1iori) = 100% - (prob(crossover) + probirmtation)) 

• Population size, number of individual rules in each generation number of generations to 
simulate 

• Maximum rule complexity, maximum number of dependent and independent attributes 
allowed in each hybrid rule (set individually for dependent and independent) 

• Average complexity of initial rule set. average number of dependent and independent 
attributes allowed in each rule of randomly generated initial population 

• Top rules to aggregate, number of rules (in order of decreasing fitness) to use in 
computing aggregate fitness by tire statistical package 
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D. PROBLEMS AND IMPROVEMENTS 



Before this discussion of DaMI implementation is concluded, we would like to discuss 
some of the problems encountered in our implementation and our solutions to these problems. 
We found, as many other researchers have, that genetic algorithms are quite successful at 
adaptively improving the quality of tested rules to suit the provided fitness fimcfion. However, 
the greatest challenge has been to ensure that our search model adequately represented the 
research questions (i.e. the genetic algorithm is doing what it was told to do, but have we 
provided it with accurate instructions). Our focus on problems with proper tuning of the genetic 
algorithm should in no way degrade the perception that a genetic algorithm is an extremely fost 
and effective search technique. It does work as advertised!. 



1. Convergence Issues 

One challenge feced by our research was to ensure that the algorithm would effectively 
(not necessarily ph 3 ^ically) test the entire search space. A genetic algorithm will rapidly 
(especially using roulette wheel competition) improve the average fitness measure of mles within 
successive generations, but in many cases, the speed of improvement degraded the algorithm's 
ability to comprehensively examine the search space. 

It should be recalled fiom genetic search theory (Holland, 1975) that search regret (or 
missed rules of interest) is minimized if attributes of successful rules are tested in exponentially 
more combinations in successive generations, and attributes of unsuccessful rules are tested 
ejqjonentially fewer times. This is implemented in a genetic algorithm by giving successful rules 
a higher chance of selection (and tiiereby the chance to mix information with other successful 
rules) based on the level of their fitness measure. Naturally, successful mles begin to dominate 
the population (in our case take up more slots on the roulette wheel) and increase the chance that 
their constituent attributes are used for future rules. A problem arises when the fitness measure of 
a mediocre rule is disproportionately larger than the other individuals of its generation. If this 
mediocre mle dominates the population too quickly then it's attributes provide the only material 
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for future rules. The resulting phenomenon is called premature convergence (Koza, 1988) and 
will prevent comprehensive search of the entire space. 

Several steps were taken to prevent this, but generally speaking, great care must be used 
in selecting a fitness measure. If the slope of fitness in proportion to rule quality is too great, 
premature convergence is likely. The author chose to apply a natural logarithm scale to the 
fitness measure. This gave a strong relative advantage to good rules over weak rules, but slowed 
the domination of good rules (or local maximums) over their slightly weaker peers. The author 
also developed a technique called same-parent crossover randomization. Basically speaking, if 
two identical parents are selected for crossover, the resulting "offspring" are duplicates of the 
parents. In our crossover operator, if the two parents are the same, a single parent is randomly 
bisected into two offspring. Each offspring receives a portion of the parents genetic material 
(attributes) and a portion of randomly generated material. This has no effect on the algorithm at 
early stages, but it increases the mutation probability strongly as the population becomes 
dominated by a few rules (which causes the crossover operator to loose its ability to effectively 
generate new hypotheses, see Figure #13). 




Figure 13. Effect of Same-parent Crossover Randomization 
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Finally, it was noted that since a genetic algorithm is based on probabilistic selection, 
some extremely strong rules &iled to be survive (by sheer chance) despite their selective 
advantage. This is an understandable consequence of natural selection; sometimes more enable 
species die solely because of ‘T>ad luck.” The author reserved several spaces on the roulette 
wheel for the rules with the highest fitness measure in the population, regardless of their 
selection by the algorithm. This ensures that an extremely "good" rule will continue to be 
available for selection and recombination in successive generations. 



2 . Processing Speed Issues 

However sophisticated the search technique may be, we must still keep the magnitude of 
this search problem in mind. One of our research goals was to ensure that the technology created 
did not require sophisticated, e^q>ensive, or proprietary hardware or software. For this reason the 
DaMI application was developed to run on a 80486/66Mhz personal computer using the 
Microsoft Window 3.xx or Windows 95 operating system. (Pentium 166's are used for 
production runs.) A very simple problem such as analyzing relations between 15 standard 
symptoms and 21 diagnoses (Boolean fields) yields a search space of 69 billion combinations. A 
486 computer, using the "bnrte force" method, can test about 600,000 hypotheses (rules) per day. 
At that rate, this problem would take more than 3 15 years to complete. Even if the speed of 
processing could be accelerated by a factor of 100, the problem would still be unpractically large. 
We have processed runs involving exposures/demographics and diagnoses that were on the order 
of 9.457 ♦ 10*^. Actual processing benchmarks are included later in the paper, birt the point for 
the moment is that results using genetic algorithms take days not minutes to achieve. 

Naturally the author took several steps to enhance speed on the given PC architecture. 
First, the population of rriles is maintained in a RAM-based array space as is the statistical 
package’s attribute and possible value matrix. This allows the genetic operations to be carried out 
with extreme speed. Task complexity is not really a speed issue at all for the genetic algorithm 
package; unfortunately, the database under analysis carmot be placed in RAM, so the statistical 
package becomes the speed limiting operation. Genetic operations take several seconds per 
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population, but the statistical package may take hours to analyze a single, laige population. In the 
case of the statistical package, number of attribute and possible values is much more significant 
than the number of records in the analysis database. If the operating architecture could be 
enhanced to allow the genetic algorithm to pass statistical requests to multiple personal computer 
nodes, a significant processing advantage could be attained. 

The nature of our research question concerning a possible syndrome affecting Persian 
Gulf War participants limits the complexity requirement of rules generated. In other words, rules 
involving too many attributes may be statistically significant, but are so specific that they may 
only describe a single participant. Naturally, these rules may have a selective advantage over less 
specific rules, because a single outlier reporting a highly unusual combination of attributes will 
be very highly rated. However, rules involving a single individual do not suggest a syndrome, 
which by definition is a series of conditions affecting a group of individuals. Therefore, we 
included a tunable parameter which limits the maximum complexity of rules generated. Rules 
involving too many attributes are given a low fitness fimction and are not sent to the statistical 
analysis package. It should be obvious that increasing the number of attributes in a single rule 
exponentially increases the complexity of the analysis by the search package. 



3. Tuning the Fitness Measure, Verification, and Validation 

One of greatest challenges feced is to develop a fitness that accurately reflects the 
requirements of CCEP medical researchers. It is critical that feedback is obtained at every step of 
the discovery process. 

EXAMPLE Just because there is a high association between hair loss and chronic 
fatigue syndrome within the database under examination does not mean that this is of 
any medical significance. 

It must also be understood that our technique has drastically reduced the number of 
correlations to be investigated by medical researchers, but it does not guarantee that each rale is 
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of value. That knowledge can only be obtained from medical professionals. Our goal is to 
provide a catalyst ibr their research and a "jumping off point" for more in-depth clinical 
investigation. If that mindset is maintained, the genetic algorithm is proving most helpful. 

Verification is also a key issue. Rules and their associated fitness measures generated by 
a genetic algorithm will be true. That has been easily verified by conventional query. Ensuring 
that die rules generated are the best ones to describe the analysis database is more diallenging. 
We have two different methods for responding to this challenge, duplicability, and 
reproducibility. 

The database of 19,000 records has been split into several sample sets. Each sample set is 
selected randomly without replacement. We actually use two database subsets of around 7,700 
records each. The genetic algorithm is applied to one sample subset and its output rules are then 
^plied to the second subset. If the fitness measure for a rule is uniform throughout the two 
independent, randomly-selected databases, then there is confidence that this rule holds for the 
entire database and is not a statistical anomaly. We call this attribute duplicability. 

The second verification procedure is reproducibility. It caimot be proven that a genetic 
algorithm has actually found the best rules for a given search space. The otrly way to accomplish 
this is to actually check every possible combinatiorr, which we have already stated is physically 
impractical. How tiien may we have any certainty that the technique has worked; that the 
algorithm has used a sufficiently large population over a sufficiently large number of genetatiorrs 
to achieve an acceptable answer? Since a genetic algorithm depends on the simulation of survival 
of the fittest (Darwirrism) based solely on probability modeling and random number generatiotr, 
it will never artalyze the same problem the same way twice. We run every problem twice and 
note the number of rules that occur in both outcome mle sets. If both independent discovery 
sessions produce a high number of the mle intersections, then this indicates that the state space 
has been searched exhaustively (see Figures #14 and #15). If this is not the case, then the 
population size and/or number of generations must be increased for an effective discovery 
session. 
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A large number of the highest 
fitness rules are discovered b 
all three runs. This suggests 
a comprehensive search of th 
alternative space 



X X X . - hypothesis discovered by all three runs (laigcr 
CD run #3 indicate larger fitness measures) 

X X X . - hypothesis not discovered by all three runs 



Figure 14* Strong Reproducibility in GA Search 




CD run#l 
CD run #2 
CD run #3 



Little or no intersection 
between hypotheses dis- 
covered by independent runs. 
Suggests search space has 
not been effectively searched. 



X .XX.- hypothesis discovered by all three mns (larger 
x's indicate larger fitness measures) 

X X X . - hypothesis not discovered by all three runs 



Figure 15. Weak Reproducibility of GA Search 



Finally, a great deal of emphasis is placed on the discovery of rules which are intuitively 
obvious to medical professionals. This may appear insignificant at first, but as mentioned before 
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genetic algorithms are imguided random processes possessing no knowledge of medical facts. 
through their learning process, they produce a series of rules that mimic accepted medical 
knowledge then this lends confidence that accompanying rules, which do not make intuitive 
sense, may contain new and significant information. 
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VI. RESULTS 



A. SUMMARY 

DaMI has achieved striking successes throughout our experiments. The theoretical basis 
for the design of this search algoridim is sound and has allowed this system to perform and 
produce results. DaMI is a very exciting ^plication because its performance matches or exceeds 
theoretical expectations, and it identifies previously imdiscovered correlations in the CCEP 
Desert Storm Database. In this chapter, we will characterize the initial success of DaMI by 
presenting a series of experimental results which build on the fiamework developed by this 
thesis. Success in this research is metered by responding to the following questions: 

• Did the Genetic Algorithm (DaMI) perform as theoretically predicted? 

• What correlations did the Genetic Algorithm actually find in die CCEP database, and 
were these hypotheses, at least fiom a statistical perspective, consistent with the 
research goals? 

• How useful were the hypotheses discovered to CCEP medical researchers? 

Each will be examined individually in the following sections of this chapter, building up to a 
comprehensive evaluation of DaMI’s theoretical as well as practical performance. 

Twenty-five discovery sessions (nms) have been conducted by DaMI thus far, of which 
six production runs are discussed in the results section. Earlier runs were used to test the 
performance of DaMI during development and refine the settings of tunable parameters for 
optimal discovery. Genetic algorithm development is a constant process of discovery, feedback 
and refinement. The runs conducted to date are by no means all-inclusive, but rather chronicle a 
successful venture into the CCEP database. 

DaMI has been directed to analyze two different perspectives of the CCEP database 
(three identical production runs for each perspective). The first rrms search for associations 
between the gender, service, race, and reported exposures of PGW participants (LHS) and the 
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diagnoses that were assigned by the CCEP medical examination process (RHS). We refer to 
these runs as exposure-to-diagjnosis runs. The second set of runs search for associations between 
gender, service, race, and reported exposxures of PGW participants (LHS) and the standard 
symptoms tiiat were elicited during the CCEP medical examinations (RHS). We refer to these 
runs as exposure-to-symptom runs. The reader is referred to Appendix A for a detailed list of 
fields included in each analysis. Each production run utilized a population size of 1000, cross- 
over probability of 30%, mutation probability of 3.0% (see section V.C for a discussion of 
tunable parameters). Modified j-measure has been used as a fitness measrue, and only the single 
best j-measure of all combinations of each individual attribute set was used for aggregate fitness 
by the statistical analysis padcage (see section V.B). Hypotheses generated were limited to 
combinations of up to three LHS attributes and two RHS attributes. Production runs have 
simulated at least 130 generations; some were allowed to continue for 170 generafions. 



B. DID THE GENETIC ALGORITHM PERFORM AS 
EXPECTED? 

As theoretically predicted, DaMI performs very well, in terms of speed, hypothesis 
quality improvement, and search space coverage. This question focuses solely on the ability of 
DaMI to perform an efficient, self-improving search and not on the value of results to medical 
professionals (which will be discussed in the next section). The tremendous size of the search 
space has been mentioned earlier, but the number of possible combinations should be presented 
specifically at this point; 

• Exposxue-to-diagnosis Runs. 29 Boolean reported exposures, gender (2 possible 
values), service (6 values), race (8 values), and 21 Boolean diagnoses. 

Possible combinations = 2^ x 2 x 6 x 7 x 2^‘ = 9.46 x 10*® 
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• E7qK)sixre-to-symptom Runs. 29 Boolean repoited exposures, gender (2 possible 
values), service (6 values), race (8 values), and 21 Boolean symptoms. 

Possible combinations = 2^ x 2 x 6 x 7 x 2*^ = 1.48 x 10‘^ 

It is clear that these two types of runs present a credible challenge to any genetic algorithm. 
They are both computationally ejqrlosive (because of search space size) and highly unstructured 
(because of the high number of LHS and especially RHS attributes), yet DaMI has processed 
them with striking success. 



1. Analysis Speed 

DaMI’s search efiBciency allows it to perform analyses, which normally take years, in a 
matter of hours. Analysis speed is the time required for a genetic algorithm to comprehensively 
search tire given space. Comprehensive search will be dealt with shortly, but at the moment, we 
will focus on the time required for DaMI to complete an analysis. If that time is significantly 
less than would be possible using a ‘hrute force” examination of the same database, then the first 
advantage has been achieved. As mentioned in section II, it was observed that a personal 
computer can test about 600,000 possible combinations per day. If that is the case, then the 
exposure to diagnosis rrm should take about 432 billion years— this is cleariy not acceptable. 

Since DaMI never searches a space the same way twice, analysis times for the same problem 
vary; however, DaMI performs the same analysis in 36 hours (on average). Exposure-to- 
symptom runs take about 44 hours, using the genetic algorithm. Although the exposure-to- 
symptom runs involve a smaller search space, DaMI requires more generations to converge on an 
answer. Analysis times do increase in relation to the number of possible combinations; however, 
the character of the research question also affects the time required for DaMI to converge on an 
answer. Analysis times of similar runs are fairly consistent Oess than 10% deviation). A profile 
of the three DaMI exposure-to-diagnosis nms is illustrated in Figure #16. 
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Figure 16 . Analysis Speed Profile of Exposure-to-diagnosis Runs 

Notice that the processing speed increases as a small group of rules begin to dominate the 
population (convergence). It must be reiterated that DaMI uses the same platform as was used 
for “brute force” testing;” it is the selectivity of search (knowing what alternatives need not be 
tested) that gives this methodology its incredible advantage. 



2. Hypothesis Quality Improvement 

DaMI is consistently able to adaptively improve the quality of the hypotheses it 
generates as the analysis progresses. A genetic algorithm is theoretically an intelligent, adaptive 
search technique. This means that as processing time passes, the system will generate 
hypotheses of increasing quality based on the results of analyses already conducted. In the case 
of DaMI, this means quality is indicated by the fitness measure of a hypothesis. The cumulative 
fitness of a generation represents the aggregate quality of all the hypotheses synthesized during 
that generation. Although some new individuals in each generation may receive very low fitness 
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measures, if the cumulative fitness increases in successive generations, then the quality of 
hypotheses as a whole are improving. DaMI demonstrates the characteristic ability of genetic 
algorithms to rapidly increase the quality of new hypotheses generated. DaMI rapidly improves 
cumulative fitness until a small group of rules begins to dominate the population [premature 
convergence (Koza, 1989)], but (largely because of same-parent crossover randomization) it then 
boosts mutation probability and continues to break through to higher cumulative fitness plateaus. 
A profile of improving hypothesis quality for exposure-to-diagnosis runs is presented in Figure 
#17. Note that in each of the three runs, the cumulative fitness curve levels (signaling premature 
convergence) and then continues to sporadically increase. 



Cumulative 




4 2 0 8 

Generation 

Figure 17. Analysis Speed Profile of Exposure to Diagnosis Runs 
3. Reproducibility: Search Space Coverage 

While a genetic algorithm may complete a search quickly, the speed advantage is of 
limited value without some indication that the results derived are actually the best in the search 
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space. DaMI produces consistent reproducibility on the extremely large spaces it searches, 
attesting to its strong ability to search a large space by testing a small subset of possible 
combinations. As discussed in section V.D.3, proving that a genetic algorithm has completely 
examined a space is a paradoxical question— you carmot prove that the genetic algorithm made 
the right decision without testing every possible hypothesis. Reproducibility gives a strong 
indication that the alternative space has been searched effectively. Ideally, we would like 
multiple independent runs of the genetic algorithm (see section V.D.3) in order to test only a few 
of the same rales of low fitness but converge on the same rales of high fitness. A low 
intersection of low fitness rules between runs indicates that each approached convergence from 
different areas of the search space (i.e. they did not all follow the same path). A high intersection 
of high fitness rules suggests that, despite entering the search space fix>m different directions, 
each independent run has arrived at the same answer. This reproducibility strongly suggests that 
the entire search space has been effectively, but not physically, examined. 

DaMI achieves high reproducibility in sphe of the r^id search time and tremendous 
space. In the exposure-to-diagnosis study, all three runs agree on the same 16 highest fitness 
hypotheses. Lower fitness hypotheses show steadily decreasing levels of intersection, as is 
theoretically predicted. This is particularly exciting, because each production rrm has achieved 
consensus by testing only 7,100 - 7,400 of the 1,041,000 possible attribute combinations. The 
probability of three independent rrms randomly agreeing on the same sixteen hypotheses 
(especially since each run is testing only 0.7 % of all possible attribute combinations) is 
infinitesimally small. The natural question is, “Did the three runs, by some streak of luck, enter 
the search space fix>m the same starting point?” This is not the case, because the three runs only 
tested 14. 1% of the same lower fitness rules, proving that they have entered the space fiom 
different points but converged on the same answer. Note in Figure #18 that the percentage of 
rale intersection (Runs 20, 21, and 22 are the three runs conducted in the exposure-to-diagnosis 
study) between runs ^preaches 100% for rules with a fitness measure higher than 8.0. This 
intersection decreases steadily as the fitness measure decreases (going left on the gr^h). In the 
case of exposure-to-symptoms, the reproducibility is not as high, but still quite striking. In this 
study, each ran tested between 8,000 and 10,000 hypotheses. The three runs agree on 5 of 6 
highest fitness hypotheses. This is represented in Figure #19 by an intersection percentage of 
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80% on hypotheses with a fitness of over 5.3 1 (Rims 23, 24, and 25 are the three runs conducted 
in the exposure-to-symptom study). Notice that, as in the exposure-to-diagnosis study, the 
intersection between runs decreases as the fitness measure decreases, culminating with an 
intersection of only 20% for rules with fitness measures between 1.0 and 3.0. 




Figure 18. Exposure-to-diagnosis Reproducibility 



Exposures to Symptom Reproducibility 




Rtness Measure 



Figure 19. Exposure-to-symptom Reproducibility 
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Based on the high reproducibility of DaMl production runs, there is a strong Indication 
that the search space has been effectively searched for the given fitness measure and search 
parameters. This is particularly significant in the case of Desert Storm research. Recall that the 
existence of any syndrome has not yet been determined. Therefore, if DaMI foils to find a viable 
syndrome profile but can show that the space has been searched efifoctively, that information will 
be of extremely high value to CCEP research. Additionally, any comprehensive list of 
correlations between risk foctors and medical outcomes will be of value to PGW participants and 
the medical practitioners providing their ongoing medical care. 

C. WHAT DID DaMI FIIVD? 

DaMI has proven, by the standards of genetic algorithm theory, that it has studied the 
CCEP database quickly, intelligently, and comprehensively. All of the theory and development 
strategies now come down to one question, “What did we leam?” Computational results so for 
suggest that our system has succeeded at the given tasks, requiring relatively few resources. 
Experiments reveal no single syndrome, but numerous correlations do exist that require 
additional clinical analysis. 

Based on DaMI research, there is no indication that a single syndrome or other medical 
entity is causing wide-spread adverse health ramifications among a significant cross-section of 
PGW participants in the CCEP program. By “significant,” we mean that no group of over 100 
participants, sharing a common reported exposure/demogr^hic information, exhibit a unique set 
of reported symptoms and/or outcome diagnoses. Keep in mind that only the 21 most finequently 
reported diagnoses (and combinations of these) have been tested to date. This does not mean that 
a syndrome carmot exist, but the data collected by CCEP and specifically studied by this research 
does not indicate such a correlation. 

There are, however, numerous correlations of exposure/demographic information and 
associated symptoms/diagnoses which suggest tiiat smaller groups may share common health 
conditions based on shared exposure to common health risk foctors. These associations are based 
solely on statistical correlation; therefore, a final determination is withheld pending review of the 
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iafonnation by medical professionals. In any case, the examined data suggests a need for further 
research. 

The number of correlations found by DaMI is quite large; we have resisted summarizing 
hypotheses to preserve the robustness of the information. Therefore, the challenge of filtering 
and reporting awaits the input of CCEP researchers. Each exposure-to-diagnosis nm has 
produced around 4,500 hypotheses, and each exposure-to-symptom nm has produced about 6,100 
hypotheses. In each case, the three sets of rules are combined into a single hypothesis set (with 
duplicates removed). The information has been further refined, subject to the following criteria: 

• Hypotheses applying to fewer than five individuals in the sample set have been 
removed to prevent undue influence by single outliers. By definition, a syndrome is 
a medical condition shared by a number of individuals. 

• Hypotheses are derived fix>m a randomly selected 45% sample (without replacement) 
subset of the entire CCEP database. These hypotheses are tested against a separate 
45% (independent) partition of the CCEP database. Hypotheses whose fitness 
measure in the second (verification) sample differed from the fitness measure fi-om 
the original sample by more than 20% have been eliminated. Fitness measures 
which remain constant over both the original and verification sample are called 
duplicable, suggesting they hold true for the entire database and are not a statistical 
anomaly. 

The application of the aforementioned selection criteria has resulted in a set of 2,653 candidate 
h 5 qx)theses concerning exposure-to-diagnoses and 4,959 hypotheses concerning ejqx)sure-to- 
symptoms. No minimiun fitness measure threshold has been applied because the modified j- 
measure is an arbitrary score, suitable for ranking the order of interest of competing hypotheses. 
The fitness measure may not be attached to a specific interest “level.” Obviously, a great number 
of the hypotheses having low fitness measures do not contain correlations strong enough to 
support strong research attention. For this reason and for the sake of brevity, only the 100 
highest fitness hypotheses of each study are included in Appendix C and discussed in the next 
two result summary sections. 
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These two sections will discuss the highlights and some specific hypotheses firom both 
the exposuie-to-diagnosis and e?q)osuie*to-symptom studies. The exposuie-to^agnosis and 
exposure-to-symptom results are each exciting for different reasons. The exposure-to-diagnosis 
study contains many high confidence correlations-hypotheses which are applicable to over 50% 
of the participants concerned. The exposure-to-diagnosis hypotheses contain few unexpected 
correlations, but cleariy demonstrate the ability of DaMI to cull out extremely strong correlations 
firom a “mountain” of data. The exposure-to-symptom results contain many unexpected 
hypotheses, but with somewhat lower correlation strength. The exposure-to-symptom results 
attest to the sensitivity of DaMI analysis and contain new (previously undiscovered) information 
which should attract expanded clinical research. 



1. Exposure-to-diagnosis Correlations 

The exposure-to-diagnosis study yields a large munber of strong correlations (positive 
predictive values between exposure and diagnosis of over 50%) and provides conoberation to 
some intuitive aspects of medical relationships. Several new relationships have been identified, 
but few hold information that is unexpected by the non-medical analyst, at least when studied 
separately fiom associated symptoms. DaMI demonstrates a powerful ability to cull strong 
correlations fiom a large body of data, and in that respect, the results are very exciting. It must 
be reiterated that only combinations of the 21 most frequently occurring diagnoses have been 
considered at this point. However, a restmcturing of the CCEP diagnosis representation which 
groups like diagnoses (with differing ICD codes) may bear even more information. 

No single exposure or groiq> of erqrosures ^pear(s) to dominate the resulting hypotheses 
set, unlike what will be seen in the exposure-to-symptom study. Several exposures (but no 
demogr^hic attributes) appeared in many of the 100 highest fitness hypothesis. 19% of the 
hypotheses included participants who were wounded and another 19% included participants who 
saw casualties. Yet another 19% of hypotiieses included participants who reported exposure to 
“other paints” and 12% reported exposures to nerve gas. At first, the feet that many hypotheses 
include wounded participants jppears interesting because only 1% of participants in the CCEP 
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database have been wounded. Also, only 4% of CCEP participants report exposure to nerve gas, 
so that too seems to be highly represented in die hypotheses. Casualties and other paints in 
hypotheses are less surprising since both have been highly reported by CCEP participants (50% 
and 38% respectively). However, 37% of the hypotheses discovered include Post-traumatic 
Stress Disorder and 22% include Depression (CCEP, 1996, p 19). This high number of Psycho- 
sodal diagnosis prevalence in the hypothesis set decreases the surprise that many hypotheses 
concern wounded participants (as the two are commonly associated). Surprisingly, Severe Sleep 
Apnea is included in 20% of the hypotheses. Sleep Apnea is a medical condition not commonly 
linked to any CCEP reported exposure. This leaves only the prevalence of reported Nerve Gas 
eiqiosures and the diagnosis of Sleep Apnea in hypotheses as the only unexpected attributes, 
tiom a macro perspective. Reported nerve gas exposure is all the more ime?q)ected because 
chemical alarms and mustard gas (similar participant concerns) are notably scarce fiom the 
hypotheses. It will be seen later that reported nerve gas e?q>osuie plays a significant role in the 
e?q)osuie-to-symptom study. Finally, it should be noted that oil and smoke, heat and smoke, 
Pyridistine Hydrobromide (Pb), and headaches are included in few hypotheses— all are fectors 
receiving high attention in CCEP research. 

An explanation of the DaMI reporting format is included in Figure #20. While the space 
is not available to discuss even the 100 highest fitness hypotheses, several illustrative hypotheses 
are presented now in Figure #21 . Especially in the exposures-to-diagnosis study, DaMI 
demonstrates the ability to unmask high level of association between exposure/demogr^hic and 
diagnosis attributes. This association is not limited to high positive predictive value (high 
probability of then condition given tire if condition), but is also able to look at the associations in 
reverse (high probability of if condition given the then condition) and examine the 
contraindications {if condition precludes the then condition) between exposures/demographics 
and diagnoses. An example of each association type is presented below. The medical 
professional is referred to Appendix C for a complete list of hypotheses. 
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Figure 20. How to Read a DaMI Report 
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As stated before, the exposure-to-diagnosis examples presented here demonstrate the 
capability of DaMI to dig into a “mountain” of data and find strong hypotheses. The examples 
selected for presentation here are selected to illustrate that c^ability. It is highly recommended 
that the medical professional examine all of the hypotheses (Appendix C) in detail. Figure 
#2 1(a) is a hypothesis of extremely high positive predictive value. The hypothesis states that 
94% of participants diagnosed with mechanical lower back pain and m^or depression served in 
the Army. 94% is an extremely high correlafion for such a broad hypothesis (a specific diagnosis 
combination is linked to a single service). Note that both the fitness measure obtained using the 
analysis database {complex association factor) is quite close (2.39/2. 10) to that of the verification 
database {complex association verification), suggesting that the rule holds for all participants (not 
a statistical anomaly). The hypothesis illustrated in figure #2 1(b) is much more specific, but is 
still quite strong. This hypothesis states that 77% of the participants diagnosed with 
DJD/Osteoarthiitis and Severe Sleep Apnea reported eating Non-allied Forces 
food and reported e?q)osure to pesticides. DaMI is capable of isolating strong data correlations, 
regardless of hypotheses specificity. 
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Figure 22. £xposure-to-diagnosis Examples 



The next two hypotheses are equally interesting, but are much more difficult to find 
using conventional search techniques. DaMI, using the Modified J-measure is able to see 
correlations which do not fit the high positive predictive value paradigm. The hypothesis in 
Figure #22(a) states that 18% of Marine participants reporting exposure to pesticides and malaria 
have been diagnosed with asthma. A positive predictive value of 1 8% does not jump out at the 
analyst and would therefore not figure prominently in a conventional analysis; however, DaMI 
notes that only 5. 1% of all participants have been diagnosed with Asthma. This means that 
Marines reporting pesticide and malaria exposure are 3.5 times more likely to have been 
diagnosed with Asthma than the general CCEP participant population. In light of that fact, the 
18% positive predictive value of this hypothesis is indeed significant, and DaMI has assigned it a 
high fitness measure. The hypothesis in Figure #22(b) is an example of contraindication. Note 
that this hypothesis shows no high correlation in either direction. The hypothesis states that 2% 
of participants reporting no exposure to Pb and not viewing casualties have been diagnosed with 
Post-traumatic Stress Disorder (PTSD). The reader’s attention is directed to the matrix on the 
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right section of the hypothesis report. In 589 cases where the LHS is true, the RHS is felse. 

Also, in 424 cases where the RHS is true, the LHS is &lse. 1,022 participants report information 
that in some way involves this hypothesis’ exposures or diagnosis. In 99% of those cases, the 
exposures exclude the diagnosis outcome. In plain English, not reporting exposure to Pb or 
casualties precludes a diagnosis of PTSD. This hct, although readily apparent to conventional 
analysis, is very informative because of its exclusive properties and is therefore flagged by 
DaMI. 

The exposure-to-diagnosis study hypotheses exemplify the ability of our genetic 
algorithm to find both strong, obvious correlations and more intricate associations in the CCEP 
database. Many of the hypotheses reinforce “common sense” medical knowledge, but remember 
that DaMI has discovered these hypotheses without the benefit of prior medical knowledge of 
any kind. In light of this success, serious attention should be directed toward those hypotheses 
presented that do not conform to present-day medical perceptions. 



2. Exposure-to-symptom Correlations 

The exposure-to-symptom study is more comprehensive than the diagnosis studies 
because the exposure-to-symptom runs consider every reported symptom category, not a top 
stratification. Many individual hypotheses contain new (or unexpected) correlarions and there 
also several interesting trends revealed the about hypotheses as a group. This previously 
undiscovered information is of key interest to medical researchers. The author believes that this 
is the reason that exposure-to-symptom runs consistently take longer to converge and are 
somewhat less successful at reproducing than exposure-to-diagnosis runs. Even though the 
theoretical search space of exposiue-to-symptom runs is smaller, the actual search space contains 
more represented combinations (because all attributes are included) and is therefore practically 
more difficult to solve. This e7q)lains the difference in nm times for different studies noted 
previously. 

While the exposure-to-diagnosis runs contain several intuitively obvious correlations, the 
exposure-to-symptom runs produce several strong but “unexpected” trends. These unexpected 
trends take the form of pervasive exposure and symptom combinations appearing in many of the 
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highest fitness hypotheses, despite the feet that these combinations are not prevalent in the CCEP 
database as a whole. These are the specific ^^direads” of infonnation that DaMI has been 
designed to discover. 

Several exposure attributes £^pear many times in the highest fitness exposure-to- 
symptom hypotheses; 

• over 50% of the hypotheses include reported exposure to mustard gas (singly or in 
combination) 

• almost 25% include reported exposure to nerve gas 

• 14% include participants that were wounded in combat 

• 12% include participants reporting some form of pre-conflict reproductive 
difficulties. 

This is somewhat unusual because all of these attributes are reported relatively infiequently in the 
CCEP database as a whole. Mustard gas exposure has been reported by 2% of CCEP 
participants, nerve gas 6%, wounded in combat 2%, and pre-conflict reproductive difficulties 
5.5% (CCEP, 1996, p. 19). Finally, the combination of reported nerve gas exposure and pre- 
conflict reproductive difficulties occurs in 9% of the top hypotheses. Notably scarce are 
hypotheses involving actual combat, chemical alarms, scud attacks, race, service, or post-conflict 
reproductive difficulties. It is stirprising that since pre- and post-conflict reproductive difficulties 
are so highly statistically correlated, that post-conflict reproductive difficulties do not appear in 
any of the top hypotheses. 

Similaiiy, the symptoms bleeding gums and weight loss are each included in over 50% 
of the hypotheses, and 44% of the hypotheses involve a combination of both bleeding gums and 
weight loss. Only 127 (or 1.6%) of the participants in the CCEP database subset studied (7746 
total participants) reported that specific combination of symptoms. It is extremely interesting 
that so many hypotheses involve bleeding gums and weight loss, when these two symptoms are 
so scarce in the CCEP database at large. Also noteworthy is the large number of hypotheses 
relating reported mustard gas exposure to bleeding gums and weight loss (44% of hypotheses) 
and nerve gas exposure and pre-conflict reproductive difficulties with bleeding gums (9% of 
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hypotheses). Notably scare in the hypotheses are hypotheses including joint pain, head aches, 
and feitigue, the symptoms most commonly elicited by physicians (CCEP, 1996, p. 20). 

While thesis constraints prohibit discussing all 100 of the highest fitness hypotheses, 
several are included to illustrate some of the correlations discovered (Figure # 23). 




The hypothesis in Figure #23(a) is included to demonstrate that DaMI, without the aid 
of medical knowledge, will discover intuitively obvious (to medical researcher) correlations. 

This hypothesis states that 70% of Navy participants who report exposure to diesel fuel and 
mustard gas also complain of difficulty breathing. It is understandable that anyone perceiving an 
exposure to mustard gas and who works with diesel fuel may, at some time, have suffered firom 
difficulty breathing. 

In Figure #23(b), it is noted that 21% of participants reporting exposure to nerve gas and 
pre-conflict reproductive difficulties complain of both bleeding gums and muscle pain. Note that 
the fitness measure (2.85) in the analysis database is very close to that of the verification 
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database (2.43), indicating that the hypothesis holds across different independent samples of the 
entire CCEP database. This hypodiesis can be considered unexpected because this specific 
exposure combination is reported by only .5% of the participants and the symptomatology by 
only 3.9%. 

In Figure #23(c), it is noted that 9% of participants reporting e^qwsure to nerve gas and 
mustard gas, complain of both bleeding gums and weight loss. As before, the fitness measures 
(2.77/2.41) of both the analysis and verification database are quite close. Also note that this 
hypothesis holds in both directions; 6% of participants reporting bleeding gmns and weight loss 
reported e?qK>sure to nerve gas and mustard gas. This hypothesis is also considered unexpected 
because this specific e?q>osure combination is reported by only 1% of the participants and the 
symptomatology by only 1.6%. 

In summation, the exposiue-to-symptom study brings to light several correlations which 
warrant fiuther clinical analysis. Interest lies, not only in the hypotheses themselves, but also in 
the high number of correlations involving rare combinations of e?q)osures and symptoms. 

D. ARE THE RESULTS USEFUL TO MEDICAL 
PROFESSIONALS? 

The results of both the Exposure-to-diagnosis and Ejqwsure-to-symptom studies and 
research methodology have been reviewed by Ph.D. Epidemiologists on the CCEP staff and the 
Director of the Deployment Surveillance Team. CCEP Epidemiologists feel that DaMI has great 
potential for “identifying previously imrecognized patterns of symptoms and diagnoses.” (CCEP, 
Sep 1996) They also agree that DaMI has already identified many associations in the CCEP 
database diat have not been found by conventional methods. However, they strongly emphasize 
that DaMI result hypotheses must be subjected to a more detailed, epidemeological-based post- 
processing before they can be of practical use to the CCEP research effort. They recommend that 
fiiture DaMI research efforts be more closely coordinated with CCEP epidemiologists. The 
bottom line is that the substantial potential of DaMI as a research tool has been recognized by the 
medical researchers and the research sponsor has directed that DaMI be included actively in the 
study of Desert Storm Syndrome with the closer involvement of CCEP epidemiologists. 
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VII. CONCLUSION 



After many months of theoretical development, genetic algorithm design, and fine 
tuning, DaMI has accomplished its goal~to comprehensively search the CCEP Desert Storm 
database and provide medical researchers with a subset of several thousand hypotheses for further 
investigation finm the billions of possible combinations. DaMI has proven its ability to search 
an extremely large imstructured database and cull, in a reasonable amount of time, a subset of the 
highest interest rules wifiiin drat database. DaMI has more to tell us about the CCEP database, as 
it can be retuned for different search priorities and measures of interest. It may also be ^plied to 
any number of similar bodies of medical and non-medical data. 

This research began with a formidable analysis problem and an idea that the usefulness 
of computer analysis could extend beyond the conventional paradigm of “number crundiing.” 

The author believed that by imparting a genetic algorithm with a model of a human researcher’s 
interest, fiiat the genetic algorithm could intelligently attack a tremendous seardi problem and 
reduce it to a manageable size, given limited resources. We have taken a complex research 
question and unstmctured database and formulated both into a workable representation of 
researcher interest and usable source of study. A genetic algorithm (DaMI) has been created 
which can perform a self-ad^iting, intelligent search with striking results. In short, DaMI has 
achieved our vision and exceeded our wildest ejq)ectations. This thesis has shown only one 
venture into this new realm of medical research, pre-emptive employment of genetic algorithm 
analysis; there are certainly many more adventures awaiting. 



A. LESSONS LEARNED 



The author encountered few problems during this thesis process. This thesis involves a 
very high visibility and politically sensative subject. Desert Storm Syndrome. As such, there 
were numerous requirements for presentations and progress meetings in addition to the normal 
research challenges. Since the political obligations were linked to the feedback from the 
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sponsoring agency they could not be ignored; this placed a very high time demand on the author. 
Also, the sponsoring agency is located in Washington, D.C., so a great deal of travel and remote 
communication was required to ensure adequete project coordination. Finally, feedback for 
medical researchers in the field was very difBcult to obtain because of their diverse geographic 
locations and limited availability. 

The author has learned several valuable lessons finom the thesis process: 

• When doing a thesis involving data analysis, do not wait for results to start writing the thesis. 
A great deal of the thesis itself describes the theoretical basis and methodology of the 
research, and therefore, can be written before final results are achieved. The pressure of 
“doing the write-up” is a serious burden to good analysis and writing early helps to alleviate 
that ptessme. 

• If the thesis is directly fimded by an outside agency (in my case the CCEP), it is important to 
clearly identify a liaison at that agency. In my case, there was not a clear procedure for 
information exchange established diuring the first half of the project, which made 
coordination h^hazard. Once a clear coordination mechanism was put in place, the thesis 
process became mudi smoother. 

• It is critical that a researcher have a sounding board who is not directly attached to the 
research. It was very easy for me to become so engrossed in the problem, that I began 
missing glaring solutions. I was lucky to have a single individual (not a genetic algorithm or 
medical e?q)ert per say) who reality checked my research and reviewed my thesis throu^out 
my research. This feedback has proven invaluable to the quality of my thesis and the success 
of my research. 

B. RECOMMENDATIONS FOR FUTURE RESEARCH 



The success of DaMI opens the door to countless opportunities for future research. Two 
areas of study remain to be explored in the CCEP database; 
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• Analysis of demographic/exposure and a restmctuied diagnosis set. Efforts are 
currently underway to regroup participant diagnosis information so that similar 
diagnoses (even those with vastly divergent ICD codes) are grouped together. This 
will allow DaMI to analyze a m^ority of diagnoses, as opposed to the top 21 
diagnoses as presented in this thesis. 

• Analysis of time/motion study of units and their locations during the Persian Gulf 
Conflict. Since in many cases units are homogenous in location and therefore 
exposure to health risks, an analysis of the CCEP participants’ unit location in time 
and associated symptoms and/or diagnoses should prove quite fioitful. 

It should be obvious that DaMI has not been created with the sole intent of searching for 
a Desert Storm Syndrome. It is applicable to many other large, unstructured databases of 
medical and non-medical data. Aside fiom examining other bodies of data, there are several 
areas to investigate concerning DaMI itself: 

• Comparison of DaMI performance with other commercial data mining software and 
other data mining techniques Oike regression analysis, cluster analysis, and neural 
networks). 

• Modification of DaMI’s statistical package to use alternative fitness functions, such 
as Chi-square instead of just the Modified J-measure. 

• Enhancement of the DaMI genetic algorithm to utilize parallel-processing for 
statistical computations. Clearly using a single PC is less efEcient than a group of 
PC nodes operating simultaneously. This will dramatically increase search speed 
without increasing the complexity of computer hardware required. 

• Rewriting of the DaMI code into C++ or Ada, so tiiat it can run on a higher capacity 
computer platform. Of course, this will increase efiBciency, but will make the 
algorithm more restrictive (less portable) in terms of operating platforms. 
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APPENDIX A. CCEP DATA DICTIONARIES AND DATA 
COLLECTION METHODOLOGY 



DATA DICTIONARY OF CCEP DATABASE 
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Def. Updatable: 


Yes 












Date Created: 


10/5/95 3:21:36 PM 










Last Updated: 


10/5/95 3:35:06 PM 










Record Count: 


15467 








ID 


Name 


Data Type Length Usable 


Problem Action 


1 


PART_LIMAME 


Text 


20 


no 


privacy act 


Delete 


2 


PART_FNAME 


Text 


15 


no 


privacy act 


Delete 


3 


PART_MNAME 


Text 


10 


no 


privacy act 


Delete 


4 


PART_SSN 


Text 


11 


no 


privacy act 


Delete 


5 


PAY_GRADE 


Text 


4 


demographic 






6 


SERVICE 


Text 


1 


demographic 






7 


REGION 


Text 


2 


unk 






8 


DMIS 


Text 


4 


unk 






9 


PART_BDAY 


Date/Time 


8 


demographic 






10 


PART_FMP 


Text 


2 


demographic 


change # to discrete 




11 


SPON_SSN 


Text 


11 


no 


privacy act 


Delete 


12 


SMOKE_NOW 


Text 


1 


attribute 


has U's 




13 


NM_CG_NOW 


Text 


3 


attribute ? 






14 


SMOKE_PAST 


Text 


1 


attribute 


has U's 




15 


NM_CG_PAST 


Text 


3 


attribute ? 






16 


OIL_SMOKE 


Text 


1 


attribute 


has U's 




17 


HEAT_SMOKE 


Text 


1 


attribute 


has U's 




18 


PASS_SMOKE 


Text 


1 


attribute 


has U's 




19 


DIESL_FUEL 


Text 


1 


attribute 


has U's 




20 


CARC_PAINT 


Text 


1 


attribute 


has U's 




21 


OTHR_PAINT 


Text 


1 


attribute 


has U's 




22 


OTHR_SOLVE 


Text 


1 


attribute 


has U's 




23 


URANIUM 


Text 


1 


attribute 


has U's 




24 


MICROWAVES 


Text 


1 


attribute 


has U's 




25 


PESTICIDES 


Text 


1 


attribute 


has U's 




26 


NERVE_GAS 


Text 


1 


attribute 


has U's 




27 


PYRIDOSTIG 


Text 


1 


attribute 


has U's 




28 


MUSTRD_GAS 


Text 


1 


attribute 


has U's 




29 


CONTM_FOOD 


Text 


1 


attribute 


has U's 




30 


CONTM_WATR 


Text 


1 


attribute 


has U's 




31 


NONAF_WATR 


Text 


1 


attribute 


has U’s 




32 


NONAF_FOOD 


Text 


1 


attribute 


has U's 




33 


ANTHRAX 


Text 


1 


attribute 


has U’s 




34 


BOTULISM 


Text 


1 


attribute 


has U’s 




35 


MALARIA 


Text 


1 


attribute 


has U’s 




36 


OTHER_EXP1 


Text 


35 


attribute 


has U's 




37 


OTHER_EXP2 


Text 


35 


attribute 


has U's 




38 


OTHER_EXP3 


Text 


35 


attribute 


has U's 




39 


ACT_COMBAT 


Text 


1 


attribute 


has U’s 
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4C 


WOUNDED 


Text 


1 


attribute 


has U's 




41 


CASUALTIES 


Text 


1 


attribute 


has U's 




42 


SCUD_ATTAC 


Text 


1 


attribute 


has U's 




43 


CHEM_ALARM 


Text 


1 


attribute 


has U's 




44 


PQ_CHD_P 


Number (Dou 


8 


attribute 






45 


PQ_CHD_A 


Number (Dou 


8 


attribute 






46 


PQ_INF_P 


Text 


1 


attribute 


combine into single field 




47 


PQ_INF_A 


Text 


1 


attribute 


ft 




48 


PQ_MIS_P 


Number (Dou 


8 


attribute 


ft 




49 


PQ_MIS_A 


Number (Dou 


8 


attribute 


ft 




50 


PQ_SB_P 


Number (Dou 


8 


attribute 


If 




51 


PQ_SB_A 


Number (Dou 


8 


attribute 


ft 




52 


PQ_ID_P 


Number (Dou 


8 


attribute 


If 




53 


PQ_ID_A 


Number (Dou 


8 


attribute 


ft 




54 


PQ_DEF_P 


Number (Dou 


8 


attribute 


ft 




55 


PQ_DEF_A 


Number (Dou 


8 


attribute 


combine into single field 




56 


SPON_LNAME 


Text 


20 


no 


privacy act 


delete 


57 


SPON_FNAME 


Text 


11 


no 


privacy act 


delete 


58 


SPON_MNAME 


Text 


11 


no 


privacy act 


delete 


59 


SEX 


Text 


1 


demographic 


blanks 




60 


RACE 


Text 


1 


demographic 


blanks 




61 


MAR_STATUS 


Text 


1 


demographic 


blanks 




62 


DUTY_STAT 


Text 


6 


attribute 


don't know code 




63 


MOS_NEC_AF 


Text 


7 


attribute 


blanks (not too many) 




64 


LOST_WORK 


Number (Dou 


8 


maybe 


question info value 


LOFR 


65 


CHIEF_COMP 


Text 


35 


no 


text 


delete 


66 


CHIEF_DTE 


Date/Time 


8 


attribute ? 


Iquestion info value 


LOFR 


67 


CHIEF_DURA 


Number (Dou 


8 


no 


different for diff diags 


delete 


68 


FATIG_DTE 


Date/Time 


8 


maybe 


question info value 


[Tofr 


69 


FATIG_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


70 


ABDOM_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


71 


ABDOM_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


72 


BLEED_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


73 


BLEED_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


74 


DEPRE_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


75 


DEPRE_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


76 


DIARR_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


77 


DIARR_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


78 


DIFFLDTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


79 


DIFFI_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


80 


SHORT_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


81 


SHORT_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


82 


HAIRL_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


83 


HAIRL_DURA 


Number (Dou | 


8 


attribute 


number confuses algo 


yes/no 


84 


HEADA_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


85 


HEADA_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


86 , 


JOINT_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


87 » 


JOINT_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


88 


MEMOR_DTE 


Date/Time i 


8 1 maybe i 


question info value 


LOFR 
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89 


MEMOR_DURA 


Number (Dou 


8 


attribute 


number cx>nKises algo 


yes/no 


90 


MUSCL_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


91 


MUSCL_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


92 


RASH_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


93 


RASH_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


94 


SLEEP_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


95 


SLEEP_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


96 


WEIGH_DTE 


Date/Time 


8 


maybe 


question info value 


LOFR 


97 


WEIGH_DURA 


Number (Dou 


8 


attribute 


number confuses algo 


yes/no 


98 


OTHR1_COMP 


Text 


20 


no 


cani correlate text 


delete 


99 


OTHR1_DTE 


Date/Time 


8 


no 


cani correlate text 


delete 


100 


OTHR1_DURA 


Number (Dou 


8 


no 


cani correlate text 


delete 


101 


OTHR2_COMP 


Text 


20 


no 


cani correlate text 


delete 


102 


OTHR2_DTE 


Date/Time 


8 


no 


cani correlate text 


delete 


103 


OTHR2_DURA 


Number (Dou 


8 


no 


cani correlate text 


delete 


104 


OTHR3_COMP 


Text 


20 


no 


cani correlate text 


delete 


105 


OTHR3_DTE 


Date/Time 


8 


no 


cani correlate text 


delete 


106 


OTHR3_DURA 


Number (Dou 


8 


no 


cani correlate text 


delete 


107 


OTHR4_COMP 


Text 


20 


no 


cani correlate text 


delete 


108 


OTHR4_DTE 


Date/Time 


8 


no 


cani correlate text 


delete 


109 


OTHR4_DURA 


Number (Dou 


8 


no 


cani correlate text 


delete 


110 


PRLDIAG 


Text 


40 


no 


text 


delete 


111 


PRIJCD 


Text 


6 


RHS 






112 


SEC_DIAG1 


Text 


40 


no 


text 


delete 


113 


SEC_ICD1 


Text 


6 


RHS 


blanks 




114 


SEC_DIAG2 


Text 


40 


no 


text 


delete 


115 


SEC_ICD2 


Text 


6 


RHS 


blanks 




116 


SEC_DIAG3 


Text 


40 


no 


text 


delete 


117 


SEC_ICD3 


Text 


6 


RHS 


blanks 




118 


SEC_DIAG4 


Text 


40 


no 


text 


delete 


119 


SECJCD4 


Text 


6 


RHS 


blanks 




120 


SEC_DIAG5 


Text 


40 


no 


text 


delete 


121 


SEC_ICD5 


Text 


6 


RHS 


blanks 




122 


SEC_DIAG6 


Text 


40 


no 


text 


delete 


123 


SECJCD6 


Text 


6 


RHS 


blanks 




124 


ALLER_CONS 


Text 


1 


no 


question info value 


delete 


125 


AUDIO_CONS 


Text 


1 


no 


question info value 


delete 


128 


CARDLCONS 


Text 


1 


no 


question info value 


delete 


127 


DENTL_CONS 


Text 


1 


no 


question info value 


delete 


128 


DERMA_CONS 


Text 


1 


no 


question info value 


delete 


128 


EARNT_CONS 


Text 


1 


no 


question info value 


delete 


130 


ENDOC_CONS 


Text 


1 


no 


question info value 


delete 


131 


GASTR_CONS 


Text 


1 


no 


question info value 


delete 


132 


HEMAT_CONS 


Text 


1 


no 


question info value 


delete 


133 


INFEC_CONS 


Text 


1 


no 


question info value 


delete 


134 


NEPHR_CONS 


Text 


1 


no 


question info value 


delete 


135 


NEURO_CONS 


Text 


1 


no 


question info value 


delete 


136 


OCCUP_CONS 


Text 


1 


no 


question info value 


delete 


137 


pulmo_c6ns 


Text 


1 


no 


question info value 


delete 
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138 


PSYCH_CONS 


Text 


1 


no 


question info value 


delete 


139 


PTEST_CONS 


Text 


1 


no 


question info value 


delete 


140 


RHEUM_CONS 


Text 


1 


no 


question info value 


delete 


141 


MOVE_ON 


Text 


1 


no 


question info value 


delete 


142 


DIAG_DTE 


Date/Time 


8 


no 


question info value 


delete 


143 


DIAG_DONE 


Text 


1 


no 


question info value 


delete 


144 


PTQS_DONE 


Text 


1 


no 


question info value 


delete 


145 


PRQS_DONE 


Text 


1 


no 


question info value 


delete 


146 


IREL_DONE 


Text 


1 


no 


question info value 


delete 


147 


DECL_DONE 


Text 


1 


no 


question info value 


delete 


148 


HOME_ADDR1 


Text 


30 


no 


privacy act 


delete 


149 


HOME_ADDR2 


Text 


30 


no 


privacy act 


delete 


150 


HOME_TOWN 


Text 


20 


no 


privacy act 


delete 


151 


HOME_STATE 


Text 


2 


demographic 






152 


HOME_ZIP 


Text 


5 


no 


info too specific 


delete 


153 


WORK_PHONE 


Text 


12 


no 


privacy act 


delete 


154 


HOME_PHONE 


Text 


12 


no 


privacy act 


delete 


155 


DCFORM_DTE 


Date/Time 


8 


no 


no info value 


delete 


156 


STARTLATER 


Text 


1 


no 


no info value 


delete 


157 


WHENTOCALL 


Text 


15 


no 


no info value 


delete 


158 


DECLINE 


Text 


1 


no 


no info value 


delete 


159 


WITHDRAW 


Text 


1 


no 


no info value 


delete 


160 


EVAL_COMP 


Text 


1 


no 


no info value 


delete 


161 


SATISFIED 


Text 


1 


attribute ? 


question info value 




162 


PQ_DATE 


Date/Time 


8 


no 


no info value 


delete 


163 


PQ_EVALDTE 


Date/Time 


8 


no 


no info value 


delete 


164 


MIL_ADDR1 


Text 


30 


no 


no info value 


delete 


165 


MIL_ADDR2 


Text 


30 


no 


no info value 


delete 


166 


MIL_STATE 


Text 


2 


no 


no info value 


delete 


167 


MIL_ZIP 


Text 


5 


no 


no info value 


delete 


168 


CHECKL_DTE 


Date/Time 


8 


no 


no info value 


delete 


169 


REPORT_DTE 


Date/Time 


8 


no 


no info value 


delete 


170 


REPORT_TIM 


Text 


8 


no 


no info value 


delete 


171 


PRIOR_JAN 


Text 


1 


no 


no info value 


delete 


172 


REFUSED 


Text 


1 


no 


no info value 


delete 


173 


NEGLECTED 


Text 


1 


no 


no info value 


delete 


174 


EDS_VIEWED 


Yes/No 


1 


no 


no info value 


delete 


175 


DCF_MISSIN 


Text 


1 


no 


no info value 


delete 


176 


UlC 


Text 


8 


attribute 






177 


PHASE 


Text 


1 


no 


no info value 


delete 
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B. DATA COLLECTION METHODS 



This section is quoted directly from (CCEP, 1996, pp. 13-14) 

Participants may enroll in the CCEP by calling atoll-free number (1-800-796-9699), 
which provides information and referrals to individuals requesting medical evaluations or by 
contacting their local military medical treatment facility (MTF). All MHSS eligible beneficiaries 
are eligible for the CCEP. For eligibility in the CCEP, a PGW veteran (or dependent) must have 
been eligible for DoD health care in June 1994 or later. 

Once an individual is referred, the CCEP provides a two-phase, comprehensive medical 
evaluation, with Phase I being conducted at one of 184 local MTFs. Phase II (when required) is 
conducted at one of 14 regional medical centers (RMCs). The medical review includes questions 
about femily history, health, occupation, and unique e7q>osures in the Gulf War, as well as a 
structured review of symptoms. 

Once a participant has completed the examination processes, cqiies of examination 
results are forwarded to the CCEP Program Management Team (PMT), where they undergo 
quality assurance procedures, and the data are entered into the master CCEP database. 

Additionally, of those CCEP participants suffering chronic, debilitating symptoms, the 
DoD has established an SCC at Walter Reed Army Medical Center and will have a second center 
opening in mid 1996 at Wilford Hall Medical Center, Lackland AFT, Texas. 

The data, which were initially entered into a relational database, were translated into a 
statistical format for this {CCEP Report on 18,598 Participants) report. Various validity checks 
were conducted to ensure that the data were qrpropriated for interpretation. Statistical tests and 
descriptive analyses were conducted on various categories of participants, including those in 
theater during the Persian Gulf War, their spouses, and their children. Moreover, the CCEP 
participants who were in theater were compared to the PGW population as a whole and were 
stratified by imits to compare those units with higher CCEP participation to those units with 
lower CCEP participation. Specific analyses concerning self-reported exposures, physician- 
elicited symptoms, diagnoses, self-reported reprxjductive outcomes, self-reported lost workdays, 
physical evaluation boards (PEBs), and program satisfection were conducted. Additionally, a 



90 



comparative analysis with the NAMCS data was conducted iising age, sex, race, ethnicity, and 
diagnostic code variables to more closely match the CCEP population. 
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APPENDIX B. DATA DICTIONARY OF SELECTED DaMI FILES 



[THIS PAGE INTENTIONALLY LEFT BLANK] 
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Structure for table: 

Number of data records: 
Date of last update: 

Code Page: 

Fiela Field Name 
Nulls 

1 RULE 

No 

2 CF 

No ' 

3 CUMCF 

No 

4 GENERATN 

No 

5 SERVICE 

No 

6 SMOKE_NOW 

No 

7 SMOKE_PAST 

No 

8 OIL_SMOKE 

No 

9 HEAT_SMOKE 

No 

10 PASS_SMOKE 

No 

11 DIESL_FUEL 

No 

12 CARC_PAINT 

No 

13 OTHR_PAINT 

No 

14 OTHR_SOLVE 

No 

15 URANIUM 

No 

16 MICROWAVES 

No 

17 PESTICIDES 

No 

18 NERVE_GAS 

No 

19 PYRIDOSTIG 

No 

20 MUSTRD_GAS 

No 

21 CONTM_FOOD 

No 

22 CONTM_WATR 

No 

23 NONAF_WATR 

No 

24 NONAF_FOOD 

No 

25 ANTHRAX 

No 

26 BOTULISM 

No 

27 MALARIA 

No 

28 ACT_COMBAT 

No 

29 WOUNDED 

No 

30 CASUALTIES 

No 

31 SCUD_ATTAC 

No 

32 CHEM_ALARM 

No 

33 PQ_PRIOR 
No 

PQ_AFTER 
No 



C:\RESEARCH\VFP\VFPDOCS\DAMISAMP.DBF 

170340 

08/04/96 

1252 

Type 

Integer 

Numeric 

Numeric 

Integer 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 



Width 

4 

6 

6 

4 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 

3 



Dec 



2 

2 



Index 



34 



35 


SEX 

No 


36 


RACE 

No 


37 


FATIG 

No 


38 


ABDOM 

No 


39 


BLEED 

No 


40 


DEPRE 

No 


41 


DIARR 

No 


42 


Don 

No 


43 


SHORT 

No 


44 


HAIRL 

No 


45 


HEADA 

No 


46 


JOINT 

No 


47 


MEMOR 

No 


48 


MUSCL 

No 


49 


RASH 

No 


50 


SLEEP 

No 


51 
I ** 


WEIGH 

No 



Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 

Character 



Structure for table: 

Number of data records: 
Date of last update: 

Code Page: 

Field Field Name 
Nulls 

RULE_NUMBE 
No 

NO_TRUE_LH 
No 

NO_TRUE_RH 
No 

NO_TRUE_BO 
No 

NO_FALSE_B 
No 

STAND ARD_C 
No 

REVERSE_CF 
No 

COMPLEX_CF 
No 

VCOMPLEX 
No 

LHS_TEXT 
No 

RHS_TEXT 
No 

RHS.VERB 
No 

REF_NUM 
No 



1 
2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

** Total “ 



ne 



C:\RESEARCH\VFP\VFPDOCS\RULELIB.DBF 

5446 

08/04/96 

1252 

Type 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Numeric 

Character 

Character 

Character 

Integer 



Width 

8 

8 

8 

8 

8 

5 

5 

5 

5 

100 

100 

150 

4 

415 



Dec 



2 

2 

2 

2 



Index 



Desc 



APPENDIX C. TOP 100 HYPOTHESES DISCOVERED BY 
EXPOSURES-TO-DIAGNOSIS AND EXPOSURE-TO-SYMPTOM 

STUDIES 



[THIS PAGE INTENTIONALLY LEFT BLANKl 
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NPS Data Mining Initative (DaMI) 

09/06/96 Detailed Hypothesis Report: Exposure-to-diagnosis Study 
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Csj 

O) 

s. 
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# 10 60 7,161 59 584 98.0% 1.0% 2.58 2.27 280.93742 
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# 30 598 433 9 6,724 2 . 0 % 2 . 0 % 2.42 2.37 20.89844 
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